Dueling Regex and DOM Scripts

Dueling Regex and DOM Scripts - php

I'm learning how to work with Bootstrap and jQuery. It looks like I need to do something like this with my articles in order to get some of the special effects (like toggling sections open and closed)...
<section id="introduction">
<h2 class="h2Article" id="a1" data-toggle="collapse" data-target="#b1"><span class="Article"><span class="label label-primary"><small><span class="only-collapsed glyphicon glyphicon-chevron-down"></span><span class="only-expanded glyphicon glyphicon-remove-sign"></span></small> Introduction</span></span></h2>
<div class="divArticle collapse in article" id="b1">
But I have hundreds of articles to put into my database, and writing all that code would take forever. So I put this in my database instead...
<section id="introduction">
<h2 class="h2Article">Introduction</h2>
<div class="divArticle">
Next, I want to use regex or DOM to insert the missing values (preferably regex, as it's easier to work with). In fact, it was working, but now it isn't.
This is my regex script:
$Content = preg_replace('/<h2 class="h2Article">(.*?)<\/h2>/', '<h2 class="h2Article"><span class="Article"><span class="label label-primary"> <small><span class="only-collapsed glyphicon glyphicon-chevron-down"> </span><span class="only-expanded glyphicon glyphicon-remove-sign"></span> </small> $1</span></h2>', $Content);
$Content = preg_replace('/<h3 class="h3Article" id="(.*?)">(.*?) <\/h3>/', '<h3 id="$1" class="Article"><span class="label label- default">$2</span></h3>', $Content);
$Content = preg_replace('/<div class="divArticle">(.*?)<\/div>/', '<div class="divArticle">$1<div style="margin-bottom: 10px; font-size: 150%; text-align: center;"><span class="only-expanded glyphicon glyphicon-remove- sign"></span></div></div>', $Content);
$Content = str_replace('<!-- EndMainDiv -->', '<div class="divClose" style="margin-bottom: 10px; font-size: 150%; text-align: center;"><span class="only-expanded glyphicon glyphicon-remove-sign"></span></div><!-- EndMainDiv -->', $Content);
And this is what the resulting HTML looks like:
<section id="introduction">
<h2 class="h2Article"><span class="Article"><span class="label label- primary"><small><span class="only-collapsed glyphicon glyphicon-chevron- down"></span><span class="only-expanded glyphicon glyphicon-remove-sign"> </span></small> Introduction</span></h2>
<div class="divArticle">
This is my DOM script:
$i = 1; // initialize counter
// initialize DOMDocument
$dom = new DOMDocument;
#$dom->loadHTML($Content); // load the markup
$sections = $dom->getElementsByTagName('section'); // get all section tags
if($sections->length > 0) { // if there are indeed section tags inside
// work on each section
foreach($sections as $section) { // for each section tag
// $section->setAttribute('id', '#a' . $i); // set id for section tag
// get div inside each section
foreach($section->getElementsByTagName('h2') as $h2) {
if($h2->getAttribute('class') == 'h2Article') { // if this div has class maindiv
$h2->setAttribute('id', 'a' . $i); // set id for div tag
$h2->setAttribute('data-target', '#b' . $i);
// $h2->setAttribute('data-target', '#b' . $i . ',#c' . $i);
}
}
foreach($section->getElementsByTagName('div') as $div) {
if($div->getAttribute('class') == 'divArticle') { // if this div has class divArticle
$div->setAttribute('id', 'b' . $i); // set id for div tag
}
if($div->getAttribute('class') == 'divClose') { // if this div has class maindiv
$div->setAttribute('data-target', '#b' . $i); // set id for div tag
}
}
$i++; // increment counter
}
}
// back to string again, get all contents inside body
$Content = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $child) {
$Content .= $dom->saveHTML($child); // convert to string and append to the container
}
$Content = str_replace('data-target', 'data-toggle="collapse" data-target', $Content);
// $Content = str_replace('data-target', 'class="SecCon" data- toggle="collapse" data-target', $Content);
$Content = str_replace('<div class="divArticle', '<div class="divArticle collapse in article', $Content);
And this is what the HTML looks like:
<section id="introduction">
<h2 class="h2Article" id="a1" data-toggle="collapse" data- target="#b1">Introduction</h2>
<div class="divArticle collapse in article" id="b1">
The weird thing is, it works when I use the regex and DOM script BOTH, but that's going to be kind of sloppy to work with. Can anyone tell me how to modify either my regex or my DOM to make it do the job by itself?

Related

Replace certain Child value if doesn't contain certain string? or Rewrite XPATH query? Website scrape

Preface: This is the first XPath and DOM script I have ever worked on.
The following code works, to a point.
If the child->nodevalue, which should be price, is empty it throws off the rest of the elements and it just snowballs from there. I have spent hours reading, rewriting and can't come up with a way to fix it.
I am at the point where I think my XPath query could be the issue because I am out of ideas on how to test that is the right child value.
The Content I am scraping looks like this(Actually it looks nothing like this there are 148 lines of HTML for each product but these are the relevant ones):
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
Here is the code I am using.
$html =file_get_contents('http://localhost:8888/scraper/source.html');
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xpath = new \DOMXpath($doc);
$xpath->preserveWhiteSpace = FALSE;
$nodes= $xpath->query("//a[#class = 'a-link-normal s-no-outline'] | //span[#class = 'a-size-base-plus a-color-base a-text-normal'] | //span[#class = 'a-price']");
$data =[];
foreach ($nodes as $node) {
$url = $node->getAttribute('href');
if(trim($url,"\xc2\xa0 \n \t \r") != ''){
array_push($data,$url);
}
foreach ($node->childNodes as $child) {
if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
array_push($data, $child->nodeValue);
}
}
}
$chunks = (array_chunk($data, 4));
foreach($chunks as $chunk) {
$newarray = [
'url' => $chunk[0],
'title' => $chunk[1],
'todaysprice' => $chunk[2],
'hiddenprice' => $chunk[3]
];
echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .
$newarray['todaysprice'] . '</p>';
}
Outputs:
URL
Title
Price
URL
Title
Price
URL
Title
URL. <---- "Price was missing so it used the next child node value and now everything from here down is wrong."
Title
Price
URL
I am aware this code is FAR from the right but I had to start somewhere.

If I understand you correctly, you are probably looking for something like the below. For the sake of simplicty, I skipped the array building parts, and just echoed the target data.
So assume your html looks like the one below:
$html = '
<body>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The other Title I Need
</span>
</a>
</h2>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Final Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$2,000,000
</span>
</div>
</body>
';
Try this:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[#class="second class"]');
foreach($data as $datum){
echo trim($xpath->query('.//a/#href',$datum)[0]->nodeValue),"\r\n";
echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
#$price = $xpath->query('./following-sibling::span',$datum);
#EDITED
$price = $xpath->query('./following-sibling::span[#class="a-offscreen"]',$datum);
if ($price->length>0) {
echo trim($price[0]->nodeValue), "\r\n";
} else {
echo("No Price"),"\r\n";
}
echo "\r\n";
};
Output:
TheURLINeed.php
The Title I Need
$1,000,000
TheURLINeed2.php
The other Title I Need
No Price
TheURLINeed3.php
The Final Title I Need
$2,000,000

how to remove link from simple dom html data

I have this code, i get the info but with this i get the data + the link for example
require_once('simple_html_dom.php');
set_time_limit (0);
$html ='www.domain.com';
$html = file_get_html($url);
// i read the first div
foreach($html->find('#content') as $element){
// i read the second
foreach ($element->find('p') as $phone){
echo $phone;
Mobile Pixel 2 -
google << there the link
But i need remove these link, the problem is the next, i scrape this:
<p>the info that i really need is here<p>
<p class="text-right"><a class="btn btn-default espbott aplus" role="button"
href="brand/google.html">Google</a></p>
I read this:
Simple HTML Dom: How to remove elements?
But i cant find the answer
update: if i use this:
foreach ($element->find('p[class="text-right"]');
It will select the links but can't remove scrapped data

You can use file_get_content with str_get_html and replace it :
include 'simple_html_dom.php';
$content=file_get_contents($url);
$html = str_get_html($content);
// i read the first div
foreach($html->find('#content') as $element){
// i read the second
foreach ($element->find('p[class="text-right"]') as $phone){
$content=str_replace($phone,'',$content);
}
}
print $content;
die;

Or here a native version:
PHP-CODE
$sHtml = '<p>the info that i really need is here<p>
<p class="text-right"><a class="btn btn-default espbott aplus" role="button"
href="brand/google.html">Google</a></p>';
$sHtml = '<div id="wrapper">' . $sHtml . '</div>';
echo "org:\n";
echo $sHtml;
echo "\n\n";
$doc = new DOMDocument();
$doc->loadHtml($sHtml);
foreach( $doc->getElementsByTagName( 'a' ) as $element ) {
$element->parentNode->removeChild( $element );
}
echo "res:\n";
echo $doc->saveHTML($doc->getElementById('wrapper'));
Output
org:
<div id="wrapper"><p>the info that i really need is here<p>
<p class="text-right"><a class="btn btn-default espbott aplus" role="button"
href="brand/google.html">Google</a></p></div>
res:
<div id="wrapper">
<p>the info that i really need is here</p>
<p>
</p>
<p class="text-right"></p>
</div>
https://3v4l.org/RhuEU

How to get <a href= value inside div with class name only?

I am trying to get value of href inside a div with class name(class="visible-xs"). I tried this code and it gets all the href outside div as well which I don't want.
$dom = new DOMDocument;
$dom->loadHTML($code2);
foreach ($dom->getElementsByTagName('a') as $node)
{
echo $node->getAttribute("href")."\n";
}
Then i tried following but it gives me error(Fatal error: Call to undefined method DOMDocument::getElementsByClassName() in..):
$dom = new DOMDocument;
$dom->loadHTML($code2);
foreach ($dom->getElementsByClassName('visible-xs') as $bigDiv) {
echo $bigDiv->getAttribute("href")."\n";
}
could any one help me fix the above error and only get the value of href inside div with class name visible-xs ?Thanks in advance.
sample data:
<tr class="ng-scope" ng-repeat="item in itemContent">
<td class="ng-binding" style="word-wrap: break-word;">test/folder/1.mp4
<div class="visible-xs" style="padding-top: 10px;">
<!-- ngIf: item.isViewable --> class="btn btn-default ng-scope" ng-click="$root.openView(item);">View</a><!-- end ngIf: item.isViewable -->
Download
<a class="btn btn-default" href="javascript:void(0);" ng-click="item.upload()" target="_blank">Upload</a>
</div>
</td>
<!-- ngIf: hasViewables --><td class="text-right hidden-xs ng-scope" style="white-space: nowrap; width: 60px;" ng-if="hasViewables">
<!-- ngIf: item.isViewable -->class="btn btn-default ng-scope" ng-click="$root.openView(item);">View</a><!-- end ngIf: item.isViewable -->
</td><!-- end ngIf: hasViewables -->
<td class="text-right hidden-xs" style="white-space: nowrap; width: 250px;">
Download
javascript:void(0);" ng-click="item.upload()" target="_blank">Upload</a>
</td>
</tr>

There is no getElementsByClassName function. Iterate over your divs, check the class, if matched pull the links inside and output the hrefs you want (or break if you want to stop after the first match).
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
foreach ($dom->getElementsByTagName('div') as $div) {
if($div->getattribute('class') == 'visible-xs') {
foreach($div->getElementsByTagName('a') as $link) {
echo $link->getattribute('href');
}
}
}
Demo: https://eval.in/698484
Example with the break, https://eval.in/698488.

Getting the title of post

I am trying to get the title of a post using simple_html_dom the html roots can be seen below the part I am trying to get is titled This Is Our Title.
<div id="content">
<div id="section">
<div id="sectionleft">
<p>
Latest News
</p>
<ul class="cont news">
<li>
<div style="padding: 1px;">
<a href="http://www.example.com">
<img src="http://www.example.com/our-image.png" width="128" height="96" alt="">
</a>
</div>
<a href="http://www.example.com" class="name">
This is our title
</a>
<i class="info">added: Dec 16, 2015</i>
</li>
</ul>
</div>
</div>
</div>
Currently I have this
$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : '';
$html = file_get_html('http://www.example.com/'.$page);
foreach($html->find('div#section ul.cont li div a') as $element)
{
print '<br><br>';
echo $url = 'http://www.example.com/'.$element->href;
$html2 = file_get_html($url);
print '<br>';
$image = $html2->find('meta[property=og:image]',0);
print $image = $image->content;
print '<br>';
$title = $html2->find('#sectionleft ul.cont news li a.name',0);
print $title = $title->plaintext;
print '<br>';
}
The issue is here $title = $html2->find('#sectionleft ul.cont news li a.name',0); I assume I am using the wrong selector but I am literally not sure what I am doing wrong..

ul.cont news means "find <news> elements that are a child of ul.cont".
You actually want:
#sectionleft ul.cont.news li a.name
EDIT: For some reason, it seems simple_html_dom doesn't like ul.cont.news even though it's a valid CSS selector.
You can try
#sectionleft ul[class="cont news"] li a.name
which should work as long as the classes are in that order.

If this seems a little hacky, forgive me, but... you can always employ PHP to run a quick .js:
<?php
echo '<script>';
echo 'var postTitle = document.querySelector("ul.cont.news a.name").innerHTML;';
if (!isset($_GET['posttitle'])) {
echo 'window.location.href = window.location.href + "?posttitle=" + postTitle';}
echo '</script>';
$postTitle = $_GET['posttitle'];
?>

Cleaning HTML by removing extra/redundant formatting tags

I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments.
I have comments that look like this (this is a very small example. I have comments with over 100 nested tags):
<p>
<strong>
<span style="font-size: 14px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">This is a </span>
</span>
</span>
</span>
</span>
</span>
</span>
<span style="color: #006400">
<span style="font-size: 16px">
<span style="color: #b22222">Test</span>
</span>
</span>
</span>
</span>
</strong>
</p>
My questions are:
Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.
If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?
Thanks
.
Update:
I have written some code to analyze the HTML code that I have. All the HTML tags that I have are:
<span> with styles for font-size and/or color
<font> with attributes color and/or size
<a> for links (with href)
<strong>
<p> (single tag to wrap the whole comment)
<u>
I can easily write some code to convert the HTML code into bbcode (e.g. [b], [color=blue], [size=3], etc). So I above HTML will become something like:
[b][size=14][color=#006400][size=14][size=16][color=#006400]
[size=14][size=16][color=#006400]This is a [/color][/size]
[/size][/color][/size][/size][color=#006400][size=16]
[color=#b22222]Test[/color][/size][/color][/color][/size][/b]
The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated?
thanks again

Introduction
The best solution have seen so far is using HTML Tidy http://tidy.sourceforge.net/
Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.
It also ensures that the HTML document is xhtml compatible
Example
$code ='<p>
<strong>
<span style="font-size: 14px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">This is a </span>
</span>
</span>
</span>
</span>
</span>
</span>
<span style="color: #006400">
<span style="font-size: 16px">
<span style="color: #b22222">Test</span>
</span>
</span>
</span>
</span>
</strong>
</p>';
If you RUN
$clean = cleaning($code);
print($clean['body']);
Output
<p>
<strong>
<span class="c3">
<span class="c1">This is a</span>
<span class="c2">Test</span>
</span>
</strong>
</p>
You can get the CSS
$clean = cleaning($code);
print($clean['style']);
Output
<style type="text/css">
span.c3 {
font-size: 14px
}
span.c2 {
color: #006400;
font-size: 16px
}
span.c1 {
color: #006400;
font-size: 14px
}
</style>
Our the FULL HTML
$clean = cleaning($code);
print($clean['full']);
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<style type="text/css">
/*<![CDATA[*/
span.c3 {font-size: 14px}
span.c2 {color: #006400; font-size: 16px}
span.c1 {color: #006400; font-size: 14px}
/*]]>*/
</style>
</head>
<body>
<p>
<strong><span class="c3"><span class="c1">This is a</span>
<span class="c2">Test</span></span></strong>
</p>
</body>
</html>
Function Used
function cleaning($string, $tidyConfig = null) {
$out = array ();
$config = array (
'indent' => true,
'show-body-only' => false,
'clean' => true,
'output-xhtml' => true,
'preserve-entities' => true
);
if ($tidyConfig == null) {
$tidyConfig = &$config;
}
$tidy = new tidy ();
$out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );
unset ( $tidy );
unset ( $tidyConfig );
$out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );
$out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';
return ($out);
}
================================================
Edit 1 : Dirty Hack (Not Recommended)
================================================
Based on your last comment its like you want to retain the depreciate style .. HTML Tidy might not allow you to do that since its depreciated but you can do this
$out = cleaning ( $code );
$getStyle = new css2string ();
$getStyle->parseStr ( $out ['style'] );
$body = $out ['body'];
$search = array ();
$replace = array ();
foreach ( $getStyle->css as $key => $value ) {
list ( $selector, $name ) = explode ( ".", $key );
$search [] = "<$selector class=\"$name\">";
$style = array ();
foreach ( $value as $type => $att ) {
$style [] = "$type:$att";
}
$replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">";
}
Output
<p>
<strong>
<span style="font-size:14px;">
<span style="color:#006400;font-size:14px;">This is a</span>
<span style="color:#006400;font-size:16px;">Test</span>
</span>
</strong>
</p>
Class Used
//Credit : http://stackoverflow.com/a/8511837/1226894
class css2string {
var $css;
function parseStr($string) {
preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-#]+)\{([^\}]*)\}/', $string, $arr );
$this->css = array ();
foreach ( $arr [0] as $i => $x ) {
$selector = trim ( $arr [1] [$i] );
$rules = explode ( ';', trim ( $arr [2] [$i] ) );
$this->css [$selector] = array ();
foreach ( $rules as $strRule ) {
if (! empty ( $strRule )) {
$rule = explode ( ":", $strRule );
$this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] );
}
}
}
}
function arrayImplode($glue, $separator, $array) {
if (! is_array ( $array ))
return $array;
$styleString = array ();
foreach ( $array as $key => $val ) {
if (is_array ( $val ))
$val = implode ( ',', $val );
$styleString [] = "{$key}{$glue}{$val}";
}
return implode ( $separator, $styleString );
}
function getSelector($selectorName) {
return $this->arrayImplode ( ":", ";", $this->css [$selectorName] );
}
}

Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.
Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/
var fixedCode = readNestProp($("#redo"));
$("#simp").html( fixedCode );
function readNestProp(el){
var output = "";
$(el).children().each( function(){
if($(this).children().length==0){
var _that=this;
var _cssAttributeNames = ["font-size","color"];
var _tag = $(_that).prop("nodeName").toLowerCase();
var _text = $(_that).text();
var _style = "";
$.each(_cssAttributeNames, function(_index,_value){
var css_value = $(_that).css(_value);
if(typeof css_value!= "undefined"){
_style += _value + ":";
_style += css_value + ";";
}
});
output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">";
}else if(
$(this).prop("nodeName").toLowerCase() !=
$(this).find(">:first-child").prop("nodeName").toLowerCase()
){
var _tag = $(this).prop("nodeName").toLowerCase();
output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">";
}else{
output += readNestProp(this);
};
});
return output;
}
A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here:
Can jQuery get all CSS styles associated with an element?

You should look into HTMLPurifier, it's a great tool for parsing HTML and removing unnecessary and unsafe content from it. Look into the removing empty spans configs and stuff. It can be a bit of a beast to configure I admit, but that's only because it's so versatile.
It's also quite heavy, so you'd want to save the output of it the database (As opposed to reading the raw from the database and then parsing it with purifier every time.

I don't have time to finish this... maybe someone else can help. This javascript removes exact duplicate tags and disallowed tags too...
There are a few problems/things to be done,
1) regenerated tags need to be closed
2) it will only remove a tag if the tag-name & attributes are identical to another within that nodes children, so its not 'smart' enough to remove all unnecessary tags.
3) it will look through the allowed CSS variables and extract ALL those values from an element, and then write it to the output HTML, so for example:
var allowed_css = ["color","font-size"];
<span style="font-size: 12px"><span style="color: #123123">
Will be translated into:
<span style="color:#000000;font-size:12px;"> <!-- inherited colour from parent -->
<span style="color:#123123;font-size:12px;"> <!-- inherited font-size from parent -->
Code:
<html>
<head>
<script type="text/javascript">
var allowed_css = ["font-size", "color"];
var allowed_tags = ["p","strong","span","br","b"];
function initialise() {
var comment = document.getElementById("comment");
var commentHTML = document.getElementById("commentHTML");
var output = document.getElementById("output");
var outputHTML = document.getElementById("outputHTML");
print(commentHTML, comment.innerHTML, false);
var out = getNodes(comment);
print(output, out, true);
print(outputHTML, out, false);
}
function print(out, stringCode, allowHTML) {
out.innerHTML = allowHTML? stringCode : getHTMLCode(stringCode);
}
function getHTMLCode(stringCode) {
return "<code>"+((stringCode).replace(/</g,"<")).replace(/>/g,">")+"</code>";
}
function getNodes(elem) {
var output = "";
var nodesArr = new Array(elem.childNodes.length);
for (var i=0; i<nodesArr.length; i++) {
nodesArr[i] = new Array();
nodesArr[i].push(elem.childNodes[i]);
getChildNodes(elem.childNodes[i], nodesArr[i]);
nodesArr[i] = removeDuplicates(nodesArr[i]);
output += nodesArr[i].join("");
}
return output;
}
function removeDuplicates(arrayName) {
var newArray = new Array();
label:
for (var i=0; i<arrayName.length; i++) {
for (var j=0; j<newArray.length; j++) {
if(newArray[j]==arrayName[i])
continue label;
}
newArray[newArray.length] = arrayName[i];
}
return newArray;
}
function getChildNodes(elemParent, nodesArr) {
var children = elemParent.childNodes;
for (var i=0; i<children.length; i++) {
nodesArr.push(children[i]);
if (children[i].hasChildNodes())
getChildNodes(children[i], nodesArr);
}
return cleanHTML(nodesArr);
}
function cleanHTML(arr) {
for (var i=0; i<arr.length; i++) {
var elem = arr[i];
if (elem.nodeType == 1) {
if (tagNotAllowed(elem.nodeName)) {
arr.splice(i,1);
i--;
continue;
}
elem = "<"+elem.nodeName+ getAttributes(elem) +">";
}
else if (elem.nodeType == 3) {
elem = elem.nodeValue;
}
arr[i] = elem;
}
return arr;
}
function tagNotAllowed(tagName) {
var allowed = " "+allowed_tags.join(" ").toUpperCase()+" ";
if (allowed.search(" "+tagName.toUpperCase()+" ") == -1)
return true;
else
return false;
}
function getAttributes(elem) {
var attributes = "";
for (var i=0; i<elem.attributes.length; i++) {
var attrib = elem.attributes[i];
if (attrib.specified == true) {
if (attrib.name == "style") {
attributes += " style=\""+getCSS(elem)+"\"";
} else {
attributes += " "+attrib.name+"=\""+attrib.value+"\"";
}
}
}
return attributes
}
function getCSS(elem) {
var style="";
if (elem.currentStyle) {
for (var i=0; i<allowed_css.length; i++) {
var styleProp = allowed_css[i];
style += styleProp+":"+elem.currentStyle[styleProp]+";";
}
} else if (window.getComputedStyle) {
for (var i=0; i<allowed_css.length; i++) {
var styleProp = allowed_css[i];
style += styleProp+":"+document.defaultView.getComputedStyle(elem,null).getPropertyValue(styleProp)+";";
}
}
return style;
}
</script>
</head>
<body onload="initialise()">
<div style="float: left; width: 300px;">
<h2>Input</h2>
<div id="comment">
<p>
<strong>
<span style="font-size: 14px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">This is a </span>
</span>
</span>
</span>
</span>
</span>
</span>
<span style="color: #006400">
<span style="font-size: 16px">
<span style="color: #b22222"><b>Test</b></span>
</span>
</span>
</span>
</span>
</strong>
</p>
<p>Second paragraph.
<span style="color: #006400">This is a span</span></p>
</div>
<h3>HTML code:</h3>
<div id="commentHTML"> </div>
</div>
<div style="float: left; width: 300px;">
<h2>Output</h2>
<div id="output"> </div>
<h3>HTML code:</h3>
<div id="outputHTML"> </div>
</div>
<div style="float: left; width: 300px;">
<h2>Tasks</h2>
<big>
<ul>
<li>Close Tags</li>
<li>Ignore inherited CSS style in method getCSS(elem)</li>
<li>Test with different input HTML</li>
</ul>
</big>
</div>
</body>
</html>

It may not exactly address your exact problem, but what I would have done in your place is to simply eliminate all HTML tags completely, retain only pain text and line breaks.
After that was done, switch to markdown our bbcode to format your comments better. A WYSIWYG is rarely useful.
The reason forthat is because you said that all you had in the comments is presentational data, which frankly, isn't that much important.

Cleanup HTML collapses tags which seems to be what you are asking for. However, it creates a validated HTML document with CSS moved to inline styles. Many other HTML formatters won't do this because it changes the structure of the HTML document.

I remember that Adobe (Macromedia) Dreamweaver, at least slightly old versions had an option, 'Clean up HTML', and also a 'Clean up word html' to remove redundant tags etc from any webpage.

I know you're looking for an HTML DOM cleanser, but maybe js can help?
function getSpans(){
var spans=document.getElementsByTagName('span')
for (var i=0;i<spans.length;i++){
spans[i].removeNode(true);
if(i == spans.length) {
//add the styling you want here
}
}
}

Rather than waste your precious server time parsing bad HTML I would suggest you fix the root of the problem instead.
A simple solution would be to limit the characters each commentor could make to include the entire html char count as opposed to just the text count (at least that would stop infinately-large nested tags).
You could improve on that by allowing the user to switch between HTML-view and text-view - I'm sure most people would see a load of junk when in the HTML view and simply CTRL+A & DEL it.
I think it would be best if you had your own formatting chars you would parse and replace with the formatting i.e. like stack-overflow has **bold text**, visible to the poster. Or just BB-code would do, visibile to the poster.

Try not to parse the HTML with DOM but maybe with SAX (http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm)
SAX parses a document from the beginning and sends events like 'start of element' and 'end of 'element' to call the callback functions you define
Then you can build a kind of stack for all events If you have text, you could save the effect of your stack on that text.
After that you process the stack to build up new HTML with only the effect you want.

If you want to use jQuery, try this:
<p>
<strong>
<span style="font-size: 14px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">This is a </span>
</span>
</span>
</span>
</span>
</span>
</span>
<span style="color: #006400">
<span style="font-size: 16px">
<span style="color: #b22222">Test</span>
</span>
</span>
</span>
</span>
</strong>
</p>
<br><br>
<div id="out"></div> <!-- Just to print it out -->
$("span").each(function(i){
var ntext = $(this).text();
ntext = $.trim(ntext.replace(/(\r\n|\n|\r)/gm," "));
if(i==0){
$("#out").text(ntext);
}
});
You get this as a result:
<div id="out">This is a Test</div>
You could then format it anyway you want. Hope that helps you think a little differently about it...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Dueling Regex and DOM Scripts - php

Related

Replace certain Child value if doesn't contain certain string? or Rewrite XPATH query? Website scrape

how to remove link from simple dom html data

How to get <a href= value inside div with class name only?

Getting the title of post

Cleaning HTML by removing extra/redundant formatting tags

Categories

Resources