I need to parse HTML-template with DOMDocument. But HTML code may contain PHP-code blocks, for example:
<div id="test" data="<?php echo $somevar?>"> </div>
When I load this HTML I get error "Unescaped '<' not allowed in attributes values...". Parser thinks that attribute "data" has no closing quote and <php is new tag. How can I specify to ignore <php tag or something like that?
Your HTML code:
<div id="test" data="<?php echo $somevar?>"> </div>
Is not XML code. For XML it's invalid, HTML is okay. To load HTML code with DOMDocument, you can use the DOMDocument::loadHTMLÂDocs function.
It will load your template without any error.
Example / Demo:
$html = '<div id="test" data="<?php echo $somevar?>"> </div>';
$doc = new DOMDocument();
$doc->loadHTML($html);
Related: Can PHP include work for only a specified portion of a file?
If you try to parse a document with PHP tags in it, you should remove those, or capture the output of the file first, and then parse it.
You can capture the output of the file with ob_start() and ob_get_clean();.
You can remove the PHP tags with regex:
$cleaned = preg_replace("/<\?php.*?\?>/i","",$input);
This feels hacky, but...
$doc->loadHtml(str_replace('<?php', '<?php', file_get_contents($file)));
Try:
<div id="test" data="<?= htmlentities($somevar) ?>"> </div>
You can also try htmlspecialchars(), which is a "lighter" version of htmlentities().
Related
Using PHP and DOMDocument class to parse HTML from TinyMCE editor. I'm having issues inserting <hr /> elements into the editor, because DOMDocument keeps losing the rest of the code.
# Input: <hr /><p> </p><p>test input</p>
$domDoc = new DOMDocument();
$domDoc->loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
var_dump($domDoc->saveHTML());
// Result: <hr>
I can't find any reason for this, nor an option for loadHTML() to prevent this. What exactly happens and can I use hr element here?
The answer was as follows:
substr($domDoc->saveHTML($domDoc->getElementsByTagName('body')->item(0)), 6, -7)
The issue was located in saveHTML to which I gave the body node and parsed out the tags. Now I get the full HTML out. This is also a one line solution.
It seems that DomDocument has problems when it encounters an HTML string that is not entirely wrapped in a single element. So if you start with:
<h1>My Title<h1><p>My text</p>
then read it into DomDocument and use the DomDocument object to generate the HTML again, you'll get something like:
<h1>My Title<p>My text</p></h1>
For my application the solution was to wrap the entire content in a div before sending it to DomDocument. This fixes the issue posted by the OP - if there is a leading hr tag, wrapping the entire html string in a div will preserve it and the rest of the content.
I have a situation in which I think I may have to use regex to alter html tag content or src based on the class attribute.
To document I will be parsing will be either nicely formed html, partial html or php files.
EG I would need to change/fill these tags with inner content: fileX.php
<?php
echo <<<_END
<div class="identifyingClass1"></div>
<div class="identifyingClass2"><span>holding content</span></div>
<img src='http://source.com/to/change' class='identifyingClass3' alt='descrip'/>
_END;
Resulting fileX.php
<?php
echo <<<_END
<div class="identifyingClass1">New content jsd soisvkbsdv</div>
<div class="identifyingClass2">More new content</div>
<img src='new/source.tiff' class='identifyingClass3' alt='descrip'/>
_END;
The html could be complete, could be separated by php, be as is, be inside a hereDOC...
Is the best way to achieve this to just use regex or has anyone seen or used a class for this kind of thing?
Regex is evil for such case. Better you work on the generated html. Here's how you do it.
Enable output buffering. On the ob_start function add your own callback. Process the generated html with DOMDocument inside the handler. Something like this,
function my_handler($contents){
$doc = DOMDocument::loadHTML ($contents);
// change your document here and return it later
return $doc->saveHTML();
}
ob_start('my_handler');
As already stated, RegEx is not recommended for doing such kind of things. Look at this excellent answer. My personal favourite is SimleDom which provides a jQuery-like syntax and makes working with HTML in PHP actually joyful ;).
i'm using PHP Simple HTML DOM Parser to get text from a webpage.
The page i need to manipulate is something like:
<html>
<head>
<title>title</title>
<body>
<div id="content">
<h1>HELLO</h1>
Hello, world!
</div>
</body>
</html>
I need to get the h1 element and the text that has no tags.
to get the h1 i use this code:
$html = file_get_html("remote_page.html");
foreach($html->find('#content') as $text){
echo "H1: ".$text->find('h1', 0)->plaintext;
}
But the other text?
I also tried this into the foreach but i get the full text:
$text->plaintext;
but it returned also the H1 tag...
It looks like $text->find('text',2); gets what you're looking for, however I'm not sure how well that will work when the amount of text nodes is unknown. I'll keep looking.
You can simply strip html tags using strip_tags
<?php
strip_tags($input, '<br>');
?>
Use strip tags, as #Peachy pointed out. However, passing it a second argument <br> means string will ignore <br> tags, which is unnecessary. In your case,
<?php
strip_tags($text);
?>
would work as you'd like, given that you are only selecting content in the content id.
Try it
echo "H1: ".$text->find('h1', 0)->innertext;
How can I extract the content between tags with several line breaks?
I'm a newbie to regex, who would like to know how to handle unknown numbers of line break to match my query.
Task: Extract content between <div class="test"> and the first closing </div> tag.
Original source:
<div class="test">optional text<br/>
content<br/>
<br/>
content<br/>
...
content<br/>Hyperlink</div></div></div>
I've worked out the below regex,
/<div class=\"test\">(.*?)<br\/>(.*?)<\/div>/
Just wonder how to match several line breaks using regex.
There is DOM for us but I am not familiar with that.
You should not parse (x)html with regular expressions. Use DOM.
I'm a beginner in xpath, but one like this should work:
//div[#class='test']
This selects all divs with the class 'test'. You will need to load your html into a DOMDocument object, then create a DOMXpath object relating to that, and call its execute() method to get the results. It will return a DOMNodeList object.
Final code looks something like this:
$domd = new DOMDocument();
$domd->loadHTML($your_html_code);
$domx = new DOMXPath($domd);
$items = $domx->execute("//div[#class='test']");
After this, your div is in $items->item(0).
This is untested code, but if I remember correctly, it should work.
Update, forgot that you need the content.
If you need the text content (no tags), you can simply call $items->item(0)->textContent. If you also need the tags, here's the equivalent of javascript's innerHTML for PHP DOM:
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
Call it with $items->item(0) as the parameter.
You could use preg_match_all('/<div class="test">(.*?)<\/div>/si', $html, $matches);. But remember that this will match the first closing </div> within the HTML. Ie. if the HTML looks like <div class="test">...aaa...<div>...bbb...</div>...ccc...</div> then you would get ...aaa...<div>...bbb... as the result in $matches...
So in the end using a DOM parser would indeed by a better solution.
I am new to Regex, however I decided it was the easiest route to what I needed to do. Basically I have a string (in PHP) which contains a whole load of HTML code... I want to remove any tags which have style=display:none...
so for example
<img src="" style="display:none" />
<img src="" style="width:11px;display: none" >
etc...
So far my Regex is:
<img.*style=.*display.*:.*none;.* >
But that seems to leave bits of html behind and also take the next element away when used in php with preg_replace.
Like Michael pointed out, you don't want to use Regex for this purpose. A Regex does not know what an element tag is. <foo> is as meaningful as >foo< unless you teach it the difference. Teaching the difference is incredibly tedious though.
DOM is so much more convenient:
$html = <<< HTML
<img src="" style="display:none" />
<IMG src="" style="width:11px;display: none" >
<img src="" style="width:11px" >
HTML;
The above is our (invalid) markup. We feed it to DOM like this:
$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->normalizeDocument();
Now we query the DOM for all "IMG" elements containing a "style" attribute that contains the text "display". We could query for "display: none" in the XPath, but our input markup has occurences with no space inbetween:
$xpath = new DOMXPath($dom);
foreach($xpath->query('//img[contains(#style, "display")]') as $node) {
$style = str_replace(' ', '', $node->getAttribute('style'));
if(strpos($style, 'display:none') !== FALSE) {
$node->parentNode->removeChild($node);
}
}
We iterate over the IMG nodes and remove all whitespace from their style attribute content. Then we check if it contains "display:none" and if so, remove the element from the DOM.
Now we only need to save our HTML:
echo $dom->saveHTML();
gives us:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><img src="" style="width:11px"></body></html>
Screw Regex!
Addendum: you might also be interested in Parsing XML documents with CSS selectors
$html = preg_replace("/<img[^>]+style[^>]+none[^>]+>/", '', $html);
Because <img> doesn't allow any other elements inside it, this is possible; but in general, regexp is a thoroughly bad tool for parsing a recursively defined language like HTML.
Anyway, the problem you're probably hitting is that the closing > is being matched by one of the .* expressions, and there happens to be a later > on the line to match your explicit > .
If you replace all your .* by [^>]* that will prevent that. (They probably don't all need to be replaced, but you might as well).
Your regular expression is way too broad; .* means "match anything", so this would match:
<img src="foo.png" style="something">Some random displayed text : foo none; bar<br>
At the very least, you probably want to exclude closing brackets from your matches, so [^>]* instead of .*. You also might want to read this, though, and look into using something that actually understands HTML, like DOMDocument
Here is another version which works with all tags including ones with spaces between the inline style display:none or display: none. Plus it deletes the content inside the tags.
$html = preg_replace('/<[^>]+style[^>]+display:\s*none[^>]+>.*?>/', '', $html);
So I have tested it with the following and it works fine.
Only show<div style='display:none'>Delete inside content as well</div> this text.
Only show<span style='display: none'>Delete inside content as well</span> this text.
Only show<div style="display: none">Delete inside content as well</div> this text.
Only show<span style="display:none;">Delete inside content as well</span> this text.
Should now only output.
Only show this text.