PHP: html tidy repair string: making it not encase everything in <html>

PHP: html tidy repair string: making it not encase everything in <html> - php

Using the following code:
$tidy = new tidy();
$clean = $tidy->repairString("<p>Hello</p>");
This encases the string in the whole shenanigans:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<p>Hello</p>
</body>
</html>
Since I'm using it on a "description" field, containing some html tags from time to time, I just want to use it to fix anomalies in the string, forexample unclosed elements, elements that are closed but not opened and so on, not encase it like this as a full html document.
If the string doesnt contain any html at all, it should just return the input. And if it contains html like the example above, it should fix whatever there is to fix, (which is nothing in this example) and not encase it in a full document.
Anyone know how to make HTML Tidy not encase it like this?

I was struggling with the same problem. But found it in the tidy documentation. If you add 'show-body-only' => true it will not show the complete html header and so on.
$tidy = new tidy();
$input = "<p>A paragraph with <b>bold<b> text";
$clean = $tidy->repairString($input,array('show-body-only' => true));
echo $clean;
will show:<p>A paragraph with <b>bold</b> text</p>

Related

Search and replace a string of HTML using the PHP DOM Parser

How can I search and replace a specific string (text + html tags) in a web page using the native PHP DOM Parser?
For example, search for
<p> Check this site </p>
This string is somewhere inside inside an html tree.
I would like to find it and replace it with another string. For example,
<span class="highligher"><p> Check this site </p></span>
Bear in mind that there is no ID to the <p> or <a> nodes. There can be many of those identical nodes, holding different pieces of text.
I tried str_replace, however it fails with complex html markup, so I have turned to HTML Parsers now.
EDIT:
The string to be found and replaced might contain a variety of HTML tags, like divs, headlines, bolds etc.. So, I am looking for a solution that can construct a regex or DOM xpath query depending on the contents of the string being searched.
Thanks!

Is this what you wanted:
<?php
// load
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
// search p elements
$p_elements = $doc->getElementsByTagName('p');
// parse this elements, if available
if (!is_null($p_elements))
{
foreach ($p_elements as $p_element)
{
// get p element nodes
$nodes = $p_element->childNodes;
// check for "a" nodes in these nodes
foreach ($nodes as $node) {
// found an a node - check must be defined better!
if(strtolower($node->nodeName) === 'a')
{
// create the new span element
$span_element = $doc->createElement('span');
$span_element->setAttribute('class', 'highlighter');
// replace the "p" element with the span
$p_element->parentNode->replaceChild($span_element, $p_element);
// append the "p" element to the span
$span_element->appendChild($p_element);
}
}
}
}
// output
echo '<pre>';
echo htmlentities($doc->saveHTML());
echo '</pre>';
This HTML is the basis for conversion:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<p> Check this site </p>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><p> Check this site </p>
</body></html>
The output looks like that, it wraps the elements you mentioned:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Your Title Here</title></head><body bgcolor="FFFFFF">
<hr>Link Name
is a link to another nifty site
<h1>This is a Header</h1>
<h2>This is a Medium Header</h2>
<span class="highlighter"><p> Check this site </p></span>
Send me mail at <a href="mailto:support#yourcompany.com">
support#yourcompany.com</a>.
<p> This is a new paragraph!
</p><hr><span class="highlighter"><p> Check this site </p></span>
</body></html>

You could use a regular expression with preg_replace.
preg_replace("/<\s*p[^>]*>(.*?)<\s*\/\s*p>/", '<span class="highligher"><p>$1</p></span>', '<p> Check this site</p>');
The third parameter of preg_replace can be used to restrict the number of replacements
http://php.net/manual/en/function.preg-replace.php
http://www.pagecolumn.com/tool/all_about_html_tags.htm - for more examples on regular expressions for HTML
You will need to edit the regular expression to only capture the p tags with the google href
EDIT
preg_replace("/<\s*\w.*?><a href\s*=\s*\"?\s*(.*)(google.com)\s*\">(.*?)<\/a>\s*<\/\s*\w.*?>/", '<span class="highligher"><p>$3</p></span>', $string);

Scrape an H1 element on the current page in PHP

I'm currently working with Wordpress. I have a hook that runs before a <title> attribute is populated with text that a user enters in the dashboard.
Now I want to set a default title of each page to equal an <h1> attribute text value on a current page. A fragment of the callback function for the hook I'm working with would look like:
if (!$seoTitle) {
$seoTitle = '<....>';
}
return $seoTitle;
I want seoTitle to default to an <h1> element text on the current page. Is it doable? How can I achieve this?

I'm not totally sure how you get your HTML but you could parse it with the built in DOM parser.
<?php
$html = "<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading one</h1>
<p>This is a paragraph.</p>
<h1>This is a Heading two</h1>
<p>This is a paragraph.</p>
<h1>This is a Heading three</h1>
<p><a href='testwww'> This is a paragraph.</a></p>
</body>
</html>";
$dom = new DOMDocument();
$dom->loadHTML($html);
//If you want to get it from a website you could do the following:
//$dom->loadHTML(file_get_contents('http://www.w3schools.com/'));
// iterate through the html to get all h1 text
foreach($dom->getElementsByTagName('h1') as $heading) {
$h1 = $heading->nodeValue;
echo $h1 . "<br>";
}
?>

Assuming you have your HTML content within a variable and doing this after the page has fully loaded please take a look at the below example:
<?php
$htmlContent = '<html><body><h1>HELLO</h1></body></html>'; // change this to what you need
$seoTitle = preg_replace('/(.*)<h1>([^>]*)<\/h1>(.*)/is', '$2', $htmlContent);
echo $seoTitle; // will output: HELLO
?>

echo "<h1>".(string)$seoTitle."</h1>";
Should work. You can also break out of the php ?> and then type regular html and then break in when you wanna echo the variable.

string's result is different after load in domdocument

I want to have same result after load in domdocument. how to do it?
echo "Café";
$s = <<<HTML
<html>
<head>
</head>
<body>
Café
</body>
</html>
HTML;
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;
first echo's result is = Café
second echo's result is =CafÃ©

You need to mark your HTML as UTF-8 encoded
$s = <<<HTML
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
Café
</body>
</html>
HTML;
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;

your problem is Encoding,
for the First Echo, you echo the text with your default encoding,
but for the text randered through the DOMDocument,
the e+apostroph is split into two chars,
i dont know how to enforce the right encoding to DOMDoc...
but i am sure this is your problem
hope i helped,
best of luck.

With First echo before HTML you send HEADERS with your server default encoding. This ignores any next set encodings..
You must first echo
<Html tag and encodings etc..
and than echo any other values..

Adding style tags to head with PHP DOMDocument

I want to create and add a set of <style> tags to the head tags of an HTML document.
I know I can start out like this:
$url_contents = file_get_contents('http://example.com');
$dom = new DOMDocument;
$dom->loadHTML($url_contents);
$new_elm = $dom->createElement('style', 'css goes here');
$elm_type_attr = $dom->createAttribute('type');
$elm_type_attr->value = 'text/css';
$new_elm->appendChild($elm_type_attr);
Now, I also know that I can add the new style tags to the HTML like this:
$dom->appendChild($ss_elm);
$dom->saveHTML();
However, this would create the following scenario:
<html>
<!--Lots of HTML here-->
</html><style type="text/css">css goes here</style>
The above is essentially pointless; the CSS is not parsed and just sits there.
I found this solution online (obviously didn't work):
$head = $dom->getElementsByTagName('head');
$head->appendChild($new_elm);
$dom->saveHTML();
Thanks for the help!!
EDIT:
Is it possible?

getElementsByTagName returns an array of nodes, so probably try
$head->[0]->appendChild($new_elm);

$head = $dom->getElementsByTagName('head');
Return a DOMNodeList. I think it will be better to get the first element like this
$head = $dom->getElementsByTagName('head')->item(0);
So $head will be a DOMNode object. So you can use the appendChild method.

This is the solution that worked for me
// Create new <style> tag containing given CSS
$new_elm = $dom->createElement('style', 'css goes here');
$new_elm->setAttribute('type', 'text/css');
// Inject the new <style> Tag in the document head
$head = $dom->getElementsByTagName('head')->item(0);
$head->appendChild($new_elm);
You can also add this line at the end to have a clean indentation
// Add a line break between </style> and </head> (optional)
$head->insertBefore($dom->createTextNode("\n"));

Fixing unclosed HTML tags

I am working on some blog layout and I need to create an abstract of each post (say 15 of the lastest) to show on the homepage. Now the content I use is already formatted in html tags by the textile library. Now if I use substr to get 1st 500 chars of the post, the main problem that I face is how to close the unclosed tags.
e.g
<div>.......................</div>
<div>...........
<p>............</p>
<p>...........| 500 chars
</p>
<div>
What I get is two unclosed tags <p> and <div> , p wont create much trouble , but div just messes with the whole page layout. So any suggestion how to track the opening tags and close them manually or something?

There are lots of methods that can be used:
Use a proper HTML parser, like DOMDocument
Use PHP Tidy to repair the un-closed tag
Some would suggest HTML Purifier

As ajreal said, DOMDocument is a solution.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.

You can use DOMDocument to do it, but be careful of string encoding issues. Also, you'll have to use a complete HTML document, then extract the components you want. Here's an example:
function make_excerpt ($rawHtml, $length = 500) {
// append an ellipsis and "More" link
$content = substr($rawHtml, 0, $length)
. '… More >';
// Detect the string encoding
$encoding = mb_detect_encoding($content);
// pass it to the DOMDocument constructor
$doc = new DOMDocument('', $encoding);
// Must include the content-type/charset meta tag with $encoding
// Bad HTML will trigger warnings, suppress those
#$doc->loadHTML('<html><head>'
. '<meta http-equiv="content-type" content="text/html; charset='
. $encoding . '"></head><body>' . trim($content) . '</body></html>');
// extract the components we want
$nodes = $doc->getElementsByTagName('body')->item(0)->childNodes;
$html = '';
$len = $nodes->length;
for ($i = 0; $i < $len; $i++) {
$html .= $doc->saveHTML($nodes->item($i));
}
return $html;
}
$html = "<p>.......................</p>
<p>...........
<p>............</p>
<p>...........| 500 chars";
// output fixed html
echo make_excerpt($html, 500);
Outputs:
<p>.......................</p>
<p>...........
</p>
<p>............</p>
<p>...........| 500 chars… More ></p>
If you are using WordPress you should wrap the substr() invocation in a call to wpautop - wpautop(substr(...)). You may also wish to test the length of the $rawHtml passed to the function, and skip appending the "More" link if it isn't long enough.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP: html tidy repair string: making it not encase everything in <html> - php

Related

Search and replace a string of HTML using the PHP DOM Parser

Scrape an H1 element on the current page in PHP

string's result is different after load in domdocument

Adding style tags to head with PHP DOMDocument

Fixing unclosed HTML tags

Categories

Resources