php preg_match table and wrapping div - php

I have CMS driven content and when saving prep the content, as part of that, I want to clean the tables the authors create.
We use BootStrap on the front end, so want to be able to first - grab all tables.
Check the parent elements, if it is not <div class="table-resposnsive">, wrap it in that.
I have:
// $content = $_POST['content'];
// Set some TEST content
$content = "<h1>My Content</h1>
<p>This is some content</p>
<table border=\"1\">
<tr>
<td>cell</td>
<td>cell</td>
<td>cell</td>
</tr>
<tr>
<td>cell</td>
<td>cell</td>
<td>cell</td>
</tr>
</table>
<div align=\"center\">see the above content</div>
<p>Thanks!</p>\n\n";
// Make our example content longer with more variations...
$content = $content .
str_replace('<table border="1">', '<table border="0" class="my-table">', $content) .
str_replace('<table border="1">', '<table border="0" cellpadding="0" cellspacing="3">', $content);
$output = $content;
// Parse for table tags
preg_match_all("/<table(.*?)>/", $content, $tables);
// If we have table tags..
if(count($tables[1]) > 0) {
// loop over and get teh infor we want to build the new table tag.
foreach($tables[0] as $key => $match) {
$add_class = array();
$tag = ' '. $tables[1][$key] .' ';
$add_class[] = 'table';
// check if we have got Borders....
// If we do. add the bootstrap table-border calss.
if(strpos($tag, 'border="0"') === FALSE) {
$add_class[] = 'table-bordered';
}
// prepend any existing/custom classes.
if(strpos($tag, 'class="') > 0) {
preg_match("/class=\"(.*?)\"/", $tag, $classes);
if($classes[1]) {
$add_class = array_merge($add_class, explode(' ', $classes[1]));
}
}
// add classes.
$add_class = array_unique($add_class);
// Now - replace the original <table> tag with the new BS tag.
// adding any class attrs
// wrap in the responsive DIV. - THIS part - needs to be only added if its not already wrapped...
// this would happen if we have already edited the page before right ...
$output = str_replace($match, '<div class="table-responsive">'."\n".'<table class="'. implode(' ', $add_class) .'">', $output);
}
// replace all closing </table> tags with the closing responsive tag too...
$output = str_replace('</table>', '</table>'."\n".'</div>', $output);
}
echo highlight_string($content, TRUE);
echo '<hr>';
echo highlight_string($output, TRUE);

You can use Simple HTML dom parser do select divs
https://github.com/sunra/php-simple-html-dom-parser
$html = new simple_html_dom();
$html->file_get_html(__filepath__);
# get an element representing the second paragraph
$element = $html->find("#youdiv");`
Good luck

Related

Split long html into smaller pieces

I need to split long html into smaller pieces with respect to tags and inline styles.
E.g. given html
<table>
<tr>
<td style="font-size:12px;">Some long string here</td>
<td style="color: red">Some short string here</td>
<td style="font-weight: bold">Some specific string here</td>
</tr>
</table>
The source of the problem: I've a long html over 50k of chars and I need to translate it via google translate api which has a max limit of 5000 chars per request.
For just splitting it, you can use str_split and iterate the array:
$chunks = str_split($html, 5000);
foreach ($chunks as $chunk) {
// $chunk var is up to 5000 chars long
}
But before, you might want to consider sending it without the HTML part:
$dom = new DOMDocument();
$dom->loadHTML($html);
$text = "";
foreach ($dom->getElementsByTagName('*') as $element) {
$text .= $element->textContent . "\n";
}
Then using the split operation on the $text variable.
For both combined, let's say your input variable is $html:
function htmlToText($html)
{
$text = "";
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('*') as $element) {
$text .= $element->textContent . "\n";
}
return $text;
}
$chunks = str_split(htmlToText($html), 5000);
foreach ($chunks as $chunk) {
// dispatch the API operation over $chunk
}

html dom parser to extract href from span sibiling

Here is my html file contains date and a link in <span> tag within a table.
Can anyone help me find the link of a particular date. view link of particular date
<table>
<tbody>
<tr class="c0">
<td class="c11">
<td class="c8">
<ul class="c2 lst-kix_h6z8amo254ry-0 start">
<li class="c1">
<span>1st Apr 2014 - </span>
<span class="c6"><a class="c4" href="/link.html">View</a>
</span>
</li>
</ul>
</td>
</tr>
</td>
</table>
I want to retrieve the link for particular date
MY CODE IS LIKE THIS
include('simple_html_dom.php');
$html = file_get_html('link.html');
//store the links in array
foreach($html->find('span') as $value)
{
//echo $value->plaintext . '<br />';
$date = $value->plaintext;
if (strpos($date,$compare_text)) {
//$linkeachday = $value->find('span[class=c1]')->href;
//$day_url[] = $value->href;
//$day_url = Array("text" => $value->plaintext);
$day_url = Array("text" => $date, "link" =>$linkeachday);
//echo $value->next_sibling (a);
}
}
or
$spans = $html->find('table',0)->find('li')->find('span');
echo $spans;
$num = null;
foreach($spans as $span){
if($span->plaintext == $compare_text){
$next_span = $span->next_sibling();
$num = $next_span->plaintext;
echo($num);
break;
}
}
echo($num);
You were on the right path with your last example...
I modified it a bit to get the following which basically gets all spans, then test if they have the searched text, and if so, it displays the content of their next sibling if there is any (check the in code comments):
$input = <<<_DATA_
<table>
<tbody>
<tr class="c0">
<td class="c11">
<td class="c8">
<ul class="c2 lst-kix_h6z8amo254ry-0 start">
<li class="c1">
<span>1st Apr 2013 - </span>
<span>1st Apr 2014 - </span>
<span class="c6">
<a class="c4" href="/link.html">View</a>
</span>
<span>1st Apr 2015 - </span>
</li>
</ul>
</td>
</td>
</tr>
</tbody>
</table>
_DATA_;
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($input);
// Searched value
$searchDate = '1st Apr 2014';
// Find all the spans direct childs of li, which is a descendent of table
$spans = $html->find('table li > span');
// Loop through all the spans
foreach ($spans as $span) {
// If the span starts with the searched text && has a following sibling
if ( strpos($span->plaintext, $searchDate) === 0 && $sibling = $span->next_sibling()) {
// Then, print it's text content
echo $sibling->plaintext; // or ->innertext for raw content
// And stop (if only one result is needed)
break;
}
}
OUTPUT
View
For the string comparison, you may also (for the best) use regex...
So in the code above, you add this to build your pattern:
$pattern = sprintf('~^\s*%s~i', preg_quote($searchDate, '~'));
And then use preg_match to test the match:
if ( preg_match($pattern, $span->plaintext) && $sibling = $span->next_sibling()) {
I don't know about simple HTML DOM but the built in PHP DOM library should suffice.
Say you have your date in a string like this...
$date = '1st Apr 2014';
You can easily find the corresponding link using an XPath expression. For example
$doc = new DOMDocument();
$doc->loadHTMLFile('link.html');
$xp = new DOMXpath($doc);
$query = sprintf('//span[starts-with(., "%s")]/following-sibling::span/a', $date);
$links = $xp->query($query);
if ($links->length) {
$href = $links->item(0)->getAttribute('href');
}
include('simple_html_dom.php');
$html = file_get_html('link.html');
$compare_text = "1st Apr 2013";
$tds = $html->find('table',1)->find('span');
$num = 0;
foreach($tds as $td){
if (strpos($td->plaintext, $compare_text) !== false){
$next_td = $td->next_sibling();
foreach($next_td->find('a') as $elm) {
$num = $elm->href;
}
//$day_url = array($day => array(daylink => $day, text => $td->plaintext, link => $num));
echo $td->plaintext. "<br />";
echo $num . "<br />";
}
}

cleaning contents inside of html tags

I am trying to write a preg_replace that will clean all tag properties of the allowed tags, and all tags that do not exist in the allowed list.
Basic example- this:
<p style="some styling here">Test<div class="button">Button Text</div></p>
would turn out to be:
<p>test</p>
I have this working well.. Except for img tags and a href tags. I need to not clean the properties of the img and a tags. Possibly others. I was not sure if there was a way to set two allow lists?
1) One list for what tags are allowed to stay after being cleaned
2) One list for the tags that are allowed but left alone?
3) The rest are deleted.
Here is the script I am working on:
$string = '<p style="width: 250px;">This is some text<div class="button">This is the button</div><br><img src="waves.jpg" width="150" height="200" /></p><p><b>Title</b><br>Here is some more text and this is a link</p>';
$output = strip_tags($string, '<p><b><br><img><a>');
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i", '<$1$2$3$4$5>', $output);
echo $output;
This script should clean the $string to be:
<p>This is some text<br><img src="waves.jpg" width="150" height="200" /></p><p><b>Title</b><br>Here is some more text and this is a link</p>
http://ideone.com/aoOOUN
This function will strip an element of disallowed sub elements, clean its "stripped" sub elements, and leave the rest (recursively).
function clean($element, $allowed, $stripped){
if(!is_array($allowed) || ! is_array($stripped)) return;
if(!$element)return;
$toDelete = array();
foreach($element->childNodes as $child){
if(!isset($child->tagName))continue;
$n = $child->tagName;
if ($n && !in_array($n, $allowed) && !in_array($n, $stripped)){
$toDelete[] = $child;
continue;
}
if($n && in_array($n, $stripped)){
$attr = array();
foreach($child->attributes as $a)
$attr[] = $a->nodeName;
foreach($attr as $a)
$child->removeAttribute($a);
}
clean($child, $allowed, $stripped);
}
foreach ($toDelete as $del)
$element->removeChild($del);
}
This is the code to clean your string:
$xhtml = '<p style="width: 250px;">This is some text<div class="button">This is the button</div><br><img src="waves.jpg" width="150" height="200" /></p><p><b>Title</b><br>Here is some more text and this is a link</p>';
$dom = new DOMDocument();
$dom->loadHTML($xhtml);
$body = $dom->getElementsByTagName('body')->item(0);
clean($body, array('img', 'a'), array('p', 'br', 'b'));
echo preg_replace('#^.*?<body>(.*?)</body>.*$#s', '$1', $dom->saveHTML($body));
You should check out the Documentation for PHP's DOM classes

Creating a table of contents in php

I am looking to create a very simple, very basic nested table of contents in php which gets all the h1-6 and indents things appropriately. This means that if I have something like:
<h1>content</h1>
<h2>more content</h2>
I should get:
content
more content.
I know it will be css that creates the indents, that's fine, but how do I create a table of contents with working links to the content on the page?
apparently its hard to grasp what I am asking for...
I am asking for a function that reads an html document and pulls out all the h1-6 and makes a table of contents.
I used this package, it's pretty easy and straight forward to use.
https://github.com/caseyamcl/toc
Install via Composer by including the following in your composer.json file:
{
"require": {
"caseyamcl/toc": "^3.0",
}
}
Or, drop the src folder into your application and use a PSR-4 autoloader to include the files.
Usage
This package contains two main classes:
TOC\MarkupFixer: Adds id anchor attributes to any H1...H6 tags that do not already have any (you can specify which header tag levels to use at runtime)
TOC\TocGenerator: Generates a Table of Contents from HTML markup
Basic Example:
$myHtmlContent = <<<END
<h1>This is a header tag with no anchor id</h1>
<p>Lorum ipsum doler sit amet</p>
<h2 id='foo'>This is a header tag with an anchor id</h2>
<p>Stuff here</p>
<h3 id='bar'>This is a header tag with an anchor id</h3>
END;
$markupFixer = new TOC\MarkupFixer();
$tocGenerator = new TOC\TocGenerator();
// This ensures that all header tags have `id` attributes so they can be used as anchor links
$htmlOut = "<div class='content'>" . $markupFixer->fix($myHtmlContent) . "</div>";
//This generates the Table of Contents in HTML
$htmlOut .= "<div class='toc'>" . $tocGenerator->getHtmlMenu($myHtmlContent) . "</div>";
echo $htmlOut;
This produces the following output:
<div class='content'>
<h1 id="this-is-a-header-tag-with-no-anchor-id">This is a header tag with no anchor id</h1>
<p>Lorum ipsum doler sit amet</p>
<h2 id="foo">This is a header tag with an anchor id</h2>
<p>Stuff here</p>
<h3 id="bar">This is a header tag with an anchor id</h3>
</div>
<div class='toc'>
<ul>
<li class="first last">
<span></span>
<ul class="menu_level_1">
<li class="first last">
This is a header tag with an anchor id
<ul class="menu_level_2">
<li class="first last">
This is a header tag with an anchor id
</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
For this you have just to search for the tags in the HTML code.
I wrote two functions (PHP 5.4.x).
The first one returns an array, that contains the data of the table of contents. The data is is only the headline it self, the id of the tag (if you want to use anchors) and a sub-table of content.
function get_headlines($html, $depth = 1)
{
if($depth > 7)
return [];
$headlines = explode('<h' . $depth, $html);
unset($headlines[0]); // contains only text before the first headline
if(count($headlines) == 0)
return [];
$toc = []; // will contain the (sub-) toc
foreach($headlines as $headline)
{
list($hl_info, $temp) = explode('>', $headline, 2);
// $hl_info contains attributes of <hi ... > like the id.
list($hl_text, $sub_content) = explode('</h' . $depth . '>', $temp, 2);
// $hl contains the headline
// $sub_content contains maybe other <hi>-tags
$id = '';
if(strlen($hl_info) > 0 && ($id_tag_pos = stripos($hl_info,'id')) !== false)
{
$id_start_pos = stripos($hl_info, '"', $id_tag_pos);
$id_end_pos = stripos($hl_info, '"', $id_start_pos);
$id = substr($hl_info, $id_start_pos, $id_end_pos-$id_start_pos);
}
$toc[] = [ 'id' => $id,
'text' => $hl_text,
'sub_toc' => get_headlines($sub_content, $depth + 1)
];
}
return $toc;
}
The second returns a string that formats the toc with HTML.
function print_toc($toc, $link_to_htmlpage = '', $depth = 1)
{
if(count($toc) == 0)
return '';
$toc_str = '';
if($depth == 1)
$toc_str .= '<h1>Table of Content</h1>';
foreach($toc as $headline)
{
$toc_str .= '<p class="headline' . $depth . '">';
if($headline['id'] != '')
$toc_str .= '<a href="' . $link_to_htmlpage . '#' . $headline['id'] . '">';
$toc_str .= $headline['text'];
$toc_str .= ($headline['id'] != '') ? '</a>' : '';
$toc_str .= '</p>';
$toc_str .= print_toc($headline['sub_toc'], $link_to_htmlpage, $depth+1);
}
return $toc_str;
}
Both functions are far away from being perfect, but they work fine in my tests. Feel free to improve them.
Notice: get_headlines is not a parser, so it does not work on broken HTML code and just crashes. It also only works with lowercase <hi>-tags.
How about this (although it can only do one H level) ...
function getTOC(string $html, int $level=1) {
$toc="";
$x=0;
$n=0;
$html1="";
$safety=1000;
while ( $x>-1 and $safety-->0 ) {
$html0=strtolower($html);
$x=strpos($html0, "<h$level");
if ( $x>-1 ) {
$y=strpos($html0, "</h$level>");
$part=strip_tags(substr($html, $x, $y-$x));
$toc .="<a href='#head$n'>$part</a>\n";
$html1.=substr($html,0,$x)."<a name='head$n'></a>".substr($html, $x, $y-$x+5)."\n";
$html=substr($html, $y+5);
$n++;
}
}
$html1.=$html;
$html=$toc."\n<HR>\n".$html1;
return $html;
}
This will create a basic list of links
$html="<html><body>";
$html.="<h1>Heading 1a</h1>One Two Three";
$html.="<h2>heading 2a</h2>Four Five Six";
$html.="<h1 class='something'>Heading 1b</h1>Seven Eight Nine";
$html.="<h2>heading 2b</h2>Ten Eleven Twelve";
$html.="</body></html>";
echo getTOC($html, 1);
gives...
<a href='#head0'>Heading 1a</a>
<a href='#head1'>Heading 1b</a>
<HR>
<html><body><a name='head0'></a><h1>Heading 1a</h1>
One Two Three<h2>heading 2a</h2>Four Five Six<a name='head1'></a><h1
class='something'>Heading 1b</h1>
Seven Eight Nine<h2>heading 2b</h2>Ten Eleven Twelve</body></html>
See https://onlinephp.io/c/fceb0 for a running example
This function return the string with appended table of content only for h2 tags. 100% tested code.
function toc($str){
$html = preg_replace('/]+\>/i', '$0 In This Article', $str, 1); //toc just after first image in content
$doc = new DOMDocument();
$doc->loadHTML($html);
// create document fragment
$frag = $doc->createDocumentFragment();
// create initial list
$frag->appendChild($doc->createElement('ul'));
$head = &$frag->firstChild;
$xpath = new DOMXPath($doc);
$last = 1;
// get all H1, H2, …, H6 elements
$tagChek = array();
foreach ($xpath->query('//*[self::h2]') as $headline) {
// get level of current headline
sscanf($headline->tagName, 'h%u', $curr);
array_push($tagChek,$headline->tagName);
// move head reference if necessary
if ($curr parentNode->parentNode;
}
} elseif ($curr > $last && $head->lastChild) {
// move downwards and create new lists
for ($i=$last; $ilastChild->appendChild($doc->createElement('ul'));
$head = &$head->lastChild->lastChild;
}
}
$last = $curr;
// add list item
$li = $doc->createElement('li');
$head->appendChild($li);
$a = $doc->createElement('a', $headline->textContent);
$head->lastChild->appendChild($a);
// build ID
$levels = array();
$tmp = &$head;
// walk subtree up to fragment root node of this subtree
while (!is_null($tmp) && $tmp != $frag) {
$levels[] = $tmp->childNodes->length;
$tmp = &$tmp->parentNode->parentNode;
}
$id = 'sect'.implode('.', array_reverse($levels));
// set destination
$a->setAttribute('href', '#'.$id);
// add anchor to headline
$a = $doc->createElement('a');
$a->setAttribute('name', $id);
$a->setAttribute('id', $id);
$headline->insertBefore($a, $headline->firstChild);
}
// echo $frag;
// append fragment to document
if(!empty($tagChek)):
$doc->getElementsByTagName('section')->item(0)->appendChild($frag);
return $doc->saveHTML();
else:
return $str;
endif;
}

PHP Regex find text between custom added HTML Tags

I have he following scenario:
Got an HTML template file that will be used for mailing.
Here is a reduced example:
<table>
<tr>
<td>Heading 1</td>
<td>heading 2</td>
</tr>
<PRODUCT_LIST>
<tr>
<td>Value 1</td>
<td>Value 2</td>
</tr>
</PRODUCT_LIST>
</table>
All I need to do is to get the HTML code inside <PRODUCT_LIST> and then repeat that code as many times as products I have on an array.
What would be the right PHP Regex code for getting/replacing this List?
Thanks!
Assuming <PRODUCT_LIST> tags will never be nested
preg_match_all('/<PRODUCT_LIST>(.*?)<\/PRODUCT_LIST>/s', $html, $matches);
//HTML array in $matches[1]
print_r($matches[1]);
Use Simple HTML DOM Parser. It's easy to understand and use .
$html = str_get_html($content);
$el = $html->find('PRODUCT_LIST', 0);
$innertext = $el->innertext;
Use this function. It will return all found values as an array.
<?php
function get_all_string_between($string, $start, $end)
{
$result = array();
$string = " ".$string;
$offset = 0;
while(true)
{
$ini = strpos($string,$start,$offset);
if ($ini == 0)
break;
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
$result[] = substr($string,$ini,$len);
$offset = $ini+$len;
}
return $result;
}
$result = get_all_string_between($input_string, '<PRODUCT_LIST>', '</PRODUCT_LIST>');
as above is ok but with performance is really horrible
If You can use PHP 5 you can use DOM object like this :
<?php
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}
?>
and after adding this function You can just use it as:
$content = getTextBetweenTags('PRODUCT_LIST', $your_html);
foreach( $content as $item )
{
echo $item.'<br />';
}
?>
yep, i just learn this today. dont use preg for html with php5
try this regular expression in preg match all function
<PRODUCT_LIST>(.*?)<\/PRODUCT_LIST>

Categories