I have a function that creates a preview of a post like this
<?php $pos=strpos($post->content, ' ', 280);
echo substr($post->content,0,$pos ); ?>
But it's possible that the very first thing in that post is a <style> block. How can i create some conditional logic to make sure my preview writes what is after the style block?
If the only HTML content is a <style> tag, you could just simply use preg_replace:
echo preg_replace('#<style>.*?</style>#', '', $post->content);
However it is better (and more robust) to use DOMDocument (note that loadHTML will put a <body> tag around your post content and that is what we search for) to output just the text it contains:
$doc = new DOMDocument();
$doc->loadHTML($post->content);
echo $doc->getElementsByTagName('body')->item(0)->nodeValue . "\n";
For this sample input:
$post = (object)['content' => '<style>some random css</style>the text I really want'];
The output of both is
the text I really want
Demo on 3v4l.org
Taking a cue from the excellent comment of #deceze here's one way to use the DOM with PHP to eliminate the style tags:
<?php
$_POST["content"] =
"<style>
color:blue;
</style>
The rain in Spain lies mainly in the plain ...";
$dom = new DOMDocument;
$dom->loadHTML($_POST["content"]);
$style_tags = $dom->GetElementsByTagName('style');
foreach($style_tags as $style_tag) {
$prent = $style_tag->parentNode;
$prent->replaceChild($dom->createTextNode(''), $style_tag);
}
echo strip_tags($dom->saveHTML());
See demo here
I also took guidance from a related discussion specifically looking at the officially accepted answer.
The advantage of manipulating PHP with the DOM is that you don't even need to create a conditional to remove the STYLE tags. Also, you are working with HTML elements, so you don't have to bother with the intricacies of using a regex. Note that in replacing the style tags, they are replaced by a text node containing an empty string.
Note, tags like HEAD and BODY are automatically inserted when the DOM object executes its saveHTML() method. So, in order to display only text content, the last line uses strip_tags() to remove all HTML tags.
Lastly, while the officially accepted answer is generally a viable alternative, it does not provide a complete solution for non-compliant HTML containing a STYLE tag after a BODY tag.
You have two options.
If there are no tags in your content use strip_tags()
You could use regex. This is more complex but there is always a suiting pattern. e.g. preg_match()
Related
I would like to find all root-level #text nodes (or those with div parents) which should be wrapped inside a <p> tag. In the following text there should be three (or even just two) final root <p> tags.
<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.
The idea is to format the text nicer so that text blocks are grouped into paragraphs for HTML display. However, the following xpath I have been working out seems to fail to select the text nodes.
<?php
$html = '<div>
This text should be wrapped in a p tag.
</div>
This also should be wrapped.
<b>And</b> this.';
libxml_use_internal_errors(TRUE);
$dom = DOMDocument::loadHTML($html);
$xp = new DOMXPath($dom);
$xpath = '//text()[not(parent::p) and normalize-space()]';
foreach($xp->query($xpath) as $node) {
$element = $dom->createElement('p');
$node->parentNode->replaceChild($element, $node);
$element->appendChild($node);
}
print $dom->saveHTML();
OK, so let me rephrase my comment as an answer. If you want to match all text nodes, you should simply remove the //div part from your XPath expression. So it becomes:
//text()[not(parent::p) and normalize-space()]
Your scenario has many edge-cases and the word should is adding on top. I assume you want to do the classic a double break starts a new paragraph thingy, however this time within parent <div> (or certainly other block elements) as well.
I would let do the HTML parser most of the work but I still would work with text search and replace (next to xpath). So what you will see coming is a bit hackish but I think pretty stable:
First of all I would select all text-nodes that are of top-level or child of the said div.
(.|./div)/text()
This xpath is relative to an anchor element which is the <body> tag as it represents the root-tag of your HTML fragment when loaded into DOMDocument.
If child of a div then I would insert the starting paragraph at the very beginning.
Then in any case I would insert a break-mark (here in form of a comment) at each occurrence of the sequence that starts a new paragraph (that should be "\n\n" because of whitespace normalization, I might be wrong and if it doesn't apply, you would need to do the whitespace-normalization upfront to have this working transparently).
/* #var $result DOMText[] */
$result = $xp->query('(.|./div)/text()', $anchor);
foreach ($result as $i => $node)
{
if ($node->parentNode->tagName == 'div')
{
$insertBreakMarkBefore($node, true);
}
while (FALSE !== $pos = strpos($node->data, $paragraphSequence))
{
$node = $node->splitText($pos + $paragraphSequenceLength);
$insertBreakMarkBefore($node);
}
}
These inserted break-marks are just there to be replaced with a HTML <p> tag. A HTML parser will turn those into adequate <p>...</p> pairs so I can spare myself writing that algorithm (even though, this might be interesting). This basically work like I once outlined in some other answer but I just don't find the link any longer:
After the modification of the DOM tree, get the innter HTML of the <body> again.
Replace the set marks with "<p>" (here I mark the class as well to make this visible)
Load the HTML fragment into the parser again to re-create the DOM with the proper <p>...</p> pairs.
Obtain the HTML again from the DOMDocument parser, which now is finally.
These outlined steps in code (skipping some of the function definitions for a moment):
$needle = sprintf('%1$s<!--%2$s-->%1$s', $paragraphSequence, $paragraphComment);
$replace = sprintf("\n<p class=\"%s\">\n", $paragraphComment);
$html = strtr($innerHTML($anchor), array($needle . $needle => $replace, $needle => $replace));
echo "HTML afterwards:\n", $innerHTML($loadHTMLFragment($html));
As this shows, double sequences are replaced with a single one. Probably one at the end need to be deleted as well (if applicale, you could also trim whitespace here).
The final HTML output:
<div>
<p class="break">
This text should be wrapped in a p tag.
</p>
</div>
<p class="break">
This also should be wrapped.
</p>
<p class="break">
<b>And</b> this.</p>
Some more post-production for nice output formatting can be useful, too. Actually I think it's worth to do as it will help you tweak the algorithm (Full Demo - just seeing, whitespace normalization probably does not apply there. so use with care).
you can do it with pure JavaScript if you wish:
var content = document.evaluate(
'//text()',
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null );
for ( var i=0 ; i < content .snapshotLength; i++ ){
console.log( content .snapshotItem(i).textContent );
}
I know it is not xpath but check this out:
PHP Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net/
Features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
I have a div called
<div id="form">Content</div>
and I want to replace the content of the div with new content using Preg_replace.
what Regex should be used.?
You shouldn't be using a regex at all. HTML can come in many forms, and you would need to take all of them in account. What if the id/class doesn't come in the place you expect? The regex would have to be really complex to get you reasonable results.
Instead, you should use a DOM parser - or a really cool tool I recently stumbled across, phpQuery. With it, you can access your document in PHP almost exactly as you would with jQuery.
This will work in your case:
$html = '<div id="content">Content</div>';
$html = preg_replace('/(<\s*div[^>]*>)[^<]*(<\s*\/div\s*>)/', '$1New Content$2', $html);
echo $html; // <div id="content">New Content</div>
However note that since HTML is not a regular language it is impossible to handle all cases. The simple regex I provided will produce bad output in the following example:
<div class=">">Content</div>
What i need to do is to replace all pre tags with code tags.
Example
<pre lang="php">
echo "test";
</pre>
Becomes
<code>
echo "test";
</code>
<pre lang="html4strict">
<div id="test">Hello</div>
</pre>
Becomes
<code>
<div id="test">Hello</div>
</code>
And so on..
Default DOM functions of php have a lot of problems because of the greek text inside.
I think Simple HTML DOM Parser is what i need but i cant figure out how to do what i want. Any ideas?
UPDATE
Im moving to a new CMS thats why im writing a script to format all posts to the correct format before inserting into DB. I cant use pre tags in the new CMS.
Why not KISS (Keep It Simple, Stupid):
echo str_replace(
array('<pre>', '</pre>'),
array('<code>', '</code>'),
$your_html_with_pre_tags
);
Look at the manual. Changing <pre> tags to <code> should be as simple as:
$str = '<pre lang="php">
echo "test";
</pre>
<pre lang="html4strict">
<div id="test">Hello</div>
</pre>';
require_once("simplehtmldom/simple_html_dom.php");
$html = str_get_html($str);
foreach($html->find("pre") as $pre) {
$pre->tag = "code";
$pre->lang = null; // remove lang attribute?
}
echo $html->outertext;
// <code>
// echo "test";
// </code>
// <code>
// <div id="test">Hello</div>
// </code>
PS: you should encode the ", < and > characters in your input.
Just replacing pre tags with code tags changes the meaning and the rendering essentially and makes the markup invalid, if there are any block-level elements like div inside the element. So you need to revise your goal. Check out whether you can actually keep using pre. If not, use <div class=pre> instead, together with a style sheet that makes it behave like pre in rendering. When you just replace pre tags with div tags, you won’t create syntax errors (the content model of div allows anything that pre allows, and more).
Regarding the lang attribute, lang="php" is incorrect (by HTML specs, lang attribute specifies the human language of the content, using standard language codes), but the idea of coding information about computer language is good. It may help styling and scripting later. HTML5 drafts mention that such information can be coded using a class name that starts with language-, e.g. class="language-php"' (or, when combined with another class name,class="language-php pre"'.
I'm looking for a way to count html tags in a chunk of html using php. This may not be a full web page with a doctype body tags etc.
For example:
If I had something like this
$string = "
<div></div>
<div style='blah'></div>
<p>hello</p>
<p>its debbie mcgee
<p class='pants'>missing p above</p>
<div></div>";
I want to pass it to a function with a tag name such as
CheckHtml( $string, 'p' );
and I would like it to tell me the number of open <p> tags and the number of close p tags </p>. I don't want it to do anything fancy beyond that (no sneaky trying to fix it).
I have tried with string counts with start tags such as <p but it can too easily find things like and return wrong results.
I had a look as DOMDocument but it doesn't seem to count close tags and always expects <html> tags (although I could work around this).
Any suggestions on what to use.
To get a accurate count, you can't use string matching or regex because of the well-known problems of parsing HTML with regex
Nor can you use the output of a standard parser, because that's a DOM consisting of elements and all the information about the tags that were in the HTML has been discarded. End tags will be inferred even for valid HTML, and even some start tags (e.g. html, head, body, tbody) can be inferred. Moreover things like the adoption agency algorithm can result in there being more elements than there were tags in the HTML mark-up. For example <b><i></b>x</i> will result in there being two i elements in the DOM. At the same time, end tags that can't be matched with start tags are simply discarded, as indeed can start and end tags that appear in the wrong place. (e.g. <caption> not in <table> or <legend> not in <fieldset>)
The only way I can think you could do this in any way reliably is this:
There's an open source PHP library for parsing HTML called html5lib.
In there, there's a file called Tokenizer.php and at the end of that file there's a function called emitToken. At this point, the parser has done all the work of figuring out all the HTML weirdnesses¹, and the $token parameter contains all the information about what kind of token has been recognised, including start and end tags.
You could take the library and modify it so that it counts up the start and end tag tokens at that point, and then exposes those totals to your application code at the end of the parse process.
¹: That is, it's figured out the weirdnesses related to your counting problem. It hasn't begun to figure out the tree construction weirdnesses.
You can use substr_count() to return the number of times the needle substring occurs in the haystack $string.
$open_tag_count = substring_count( $string, '<p' );
$close_tag_count = substring_count( $string, '</p>' );
Be aware that '<param and <pre, so you may need to modify your search to handle two different specific cases:
$open_tag_count_without_attributes = substring_count( $string, '<p>' );
$open_tag_count_with_attributes = substring_count( $string, '<p ' );
$open_tag_count = $open_tag_count_without_attributes + $open_tag_count_with_attributes;
You may also wish to consider using [preg_match()][1]. Using a regular expression to parse HTML comes with a fairly substantial set of pitfalls, so use with caution.
substr_count seems like a good bet.
EDIT: You'll have to use preg_match then
I haven't tested, this but, for an idea..
function checkHTML($string,$htmlTag){
$openTags = preg_match('/<'.$htmlTag.'\b[^>]*>',$string);
$closeTags = preg_match('/<\/'.$htmlTag.'>/',$string);
return array($openTags, $closeTags);
}
$numberOfParagraphTags = checkHTML($string,'p');
echo('Open Tags:'.$numberOfParagraphTags[0].' Close Tags:'.$numberOfParagraphTags[1]);
For the chunk of HTML, try using the DomDocument PHP class instead of a string. Then you can use methods such as getElementsByTagName(); that will allow you to count the tags easier and more accurately. To load your string into a DomDocument, you could do something like this:
$doc = new DOMDocument();
$doc->loadHTML($string);
Then, to count your tags, do the following:
$tagList = $doc->getElementsByTagName($tag);
return $tagList.length;
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I need to do some regex replacement on HTML input, but I need to exclude some parts from filtering by other regexp.
(e.g. remove all <a> tags with specific href="example.com…, except the ones that are inside the <form> tag)
Is there any smart regex technique for this? Or do I have to find all forms using $regex1, then split the input to the smaller chunks, excluding the matched text blocks, and then run the $regex2 on all the chunks?
The NON-regexp way:
<?php
$html = '<html><body>a <b>bold</b> foz b c <form>l</form> a</body></html>';
$d = new DOMDocument();
$d->loadHTML($html);
$x = new DOMXPath($d);
$elements = $x->query('//a[not(ancestor::form) and #href="foo"]');
foreach($elements as $elm){
//run if contents of <a> should be visible:
while($elm->firstChild){
$elm->parentNode->insertBefore($elm->firstChild,$elm);
}
//remove a
$elm->parentNode->removeChild($elm);
}
var_dump($d->saveXML());
?>
Why can't you just dump the html string you need into a DOM helper, then use getElementsByTagName('a') to grab all anchors and use getAttribute to get the href, removeChild to remove it?
This looks like PHP, right? http://htmlpurifier.org/
Any particular reason you would want to do that with Regular Expressions? It sounds like it would be fairly straightforward in Javascript to spin through the DOM and to it that way.
In jQuery, for instance, it seems like you could do this in just a couple lines using its DOM selectors.
If forms can be nested, it is technically impossible.
If forms can not be nested, it is practically impossible. There is no function where you can use the same regex to
define an area where the matching should be done (i.e. outside form)
define things to be matched (i.e. elements)