Split long html into smaller pieces - php

I need to split long html into smaller pieces with respect to tags and inline styles.
E.g. given html
<table>
<tr>
<td style="font-size:12px;">Some long string here</td>
<td style="color: red">Some short string here</td>
<td style="font-weight: bold">Some specific string here</td>
</tr>
</table>
The source of the problem: I've a long html over 50k of chars and I need to translate it via google translate api which has a max limit of 5000 chars per request.

For just splitting it, you can use str_split and iterate the array:
$chunks = str_split($html, 5000);
foreach ($chunks as $chunk) {
// $chunk var is up to 5000 chars long
}
But before, you might want to consider sending it without the HTML part:
$dom = new DOMDocument();
$dom->loadHTML($html);
$text = "";
foreach ($dom->getElementsByTagName('*') as $element) {
$text .= $element->textContent . "\n";
}
Then using the split operation on the $text variable.
For both combined, let's say your input variable is $html:
function htmlToText($html)
{
$text = "";
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('*') as $element) {
$text .= $element->textContent . "\n";
}
return $text;
}
$chunks = str_split(htmlToText($html), 5000);
foreach ($chunks as $chunk) {
// dispatch the API operation over $chunk
}

Related

Exclude string from preg_match_all if string contains certain value

Im using preg_match_all to grab all scripts and place bfore body end like below:
preg_match_all('#(<script.*?</script>)#is', $html, $matches);
$js = '';
foreach ($matches[0] as $value):
$js .= $value;
endforeach;
$html = preg_replace('#(<script.*?</script>)#is', '', $html);
$html = preg_replace('#</body>#',$js.'</body>',$html);
This has broken some functionality on the page however for a few scripts like below:
<script data-template="bundle-summary" type="text/x-magento-template">
<li>
<strong class="label"><%- data._label_ %>:</strong>
<div data-container="options"></div>
</li>
</script>
How can i use the preg_match_all to exclude <script data-template scripts from being moved.
I figured i could check if the script x-magento-template script by doing something like below:
if (strpos($value, 'type="text/x-magento-template"') === false) {
$js .= $value;
}
Then it won't be added to the $js variable however am unsure how to stop the same scripts being deleted in the below line:
$html = preg_replace('#(<script.*?</script>)#is', '', $html);
I need to replace all scripts however not if they contain type="text/x-magento-template
Update
I did the below but am wondering if there is a more efficient way of doing this with preg_match_all?
preg_match_all('#(<script.*?</script>)#is', $html, $matches);
$js = '';
foreach ($matches[0] as $value):
if (strpos($value, 'type="text/x-magento-template"') === false) {
$js .= $value;
$html = str_replace($value, '', $html);
}
endforeach;
//$html = preg_replace('#(<script.*?</script>)#is', '', $html);
$html = preg_replace('#</body>#',$js.'</body>',$html);
After timing the difference between the method with the if statment and not the differences were negligible with a time of about 0.005 seconds each so am happy to leave it.
For html editing, a DOM approach gives better results:
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($html); // or $dom->loadHTMLFile('./file.html');
$removeList=[];
$bodyNode = $dom->getElementsByTagName('body')->item(0);
foreach ($dom->getElementsByTagName('script') as $scriptNode) {
if ( $scriptNode->hasAttribute('type') && $scriptNode->getAttribute('type')=='text/x-magento-template' )
continue;
$removeList[] = $scriptNode;
}
foreach ($removeList as $scriptNode) {
$bodyNode->appendChild($scriptNode);
}
libxml_use_internal_errors($state);
echo $dom->saveHTML();
With this code you don't have to delete script nodes since they move from their current position in dom tree to the end of the body element (since they are appended).

php preg_match table and wrapping div

I have CMS driven content and when saving prep the content, as part of that, I want to clean the tables the authors create.
We use BootStrap on the front end, so want to be able to first - grab all tables.
Check the parent elements, if it is not <div class="table-resposnsive">, wrap it in that.
I have:
// $content = $_POST['content'];
// Set some TEST content
$content = "<h1>My Content</h1>
<p>This is some content</p>
<table border=\"1\">
<tr>
<td>cell</td>
<td>cell</td>
<td>cell</td>
</tr>
<tr>
<td>cell</td>
<td>cell</td>
<td>cell</td>
</tr>
</table>
<div align=\"center\">see the above content</div>
<p>Thanks!</p>\n\n";
// Make our example content longer with more variations...
$content = $content .
str_replace('<table border="1">', '<table border="0" class="my-table">', $content) .
str_replace('<table border="1">', '<table border="0" cellpadding="0" cellspacing="3">', $content);
$output = $content;
// Parse for table tags
preg_match_all("/<table(.*?)>/", $content, $tables);
// If we have table tags..
if(count($tables[1]) > 0) {
// loop over and get teh infor we want to build the new table tag.
foreach($tables[0] as $key => $match) {
$add_class = array();
$tag = ' '. $tables[1][$key] .' ';
$add_class[] = 'table';
// check if we have got Borders....
// If we do. add the bootstrap table-border calss.
if(strpos($tag, 'border="0"') === FALSE) {
$add_class[] = 'table-bordered';
}
// prepend any existing/custom classes.
if(strpos($tag, 'class="') > 0) {
preg_match("/class=\"(.*?)\"/", $tag, $classes);
if($classes[1]) {
$add_class = array_merge($add_class, explode(' ', $classes[1]));
}
}
// add classes.
$add_class = array_unique($add_class);
// Now - replace the original <table> tag with the new BS tag.
// adding any class attrs
// wrap in the responsive DIV. - THIS part - needs to be only added if its not already wrapped...
// this would happen if we have already edited the page before right ...
$output = str_replace($match, '<div class="table-responsive">'."\n".'<table class="'. implode(' ', $add_class) .'">', $output);
}
// replace all closing </table> tags with the closing responsive tag too...
$output = str_replace('</table>', '</table>'."\n".'</div>', $output);
}
echo highlight_string($content, TRUE);
echo '<hr>';
echo highlight_string($output, TRUE);
You can use Simple HTML dom parser do select divs
https://github.com/sunra/php-simple-html-dom-parser
$html = new simple_html_dom();
$html->file_get_html(__filepath__);
# get an element representing the second paragraph
$element = $html->find("#youdiv");`
Good luck

How to get "innerContent" with DOMdocument? [duplicate]

<blockquote>
<p>
2 1/2 cups sweet cherries, pitted<br>
1 tablespoon cornstarch <br>
1/4 cup fine-grain natural cane sugar
</p>
</blockquote>
hi , i want to get the text inside 'p' tag . you see there are three different line and i want to print them separately after adding some extra text with each line . here is my code block
$tags = $dom->getElementsByTagName('blockquote');
foreach($tags as $tag)
{
$datas = $tag->getElementsByTagName('p');
foreach($datas as $data)
{
$line = $data->nodeValue;
echo $line;
}
}
main problem is $line contains the full text inside 'p' tag including 'br' tag . how can i separate the three lines to treat them respectively ??
thanks in advance.
You can do that with XPath. All you have to do is query the text nodes. No need to explode or something like that:
$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('/html/body/blockquote/p/text()') as $textNode) {
echo "\n<li>", trim($textNode->textContent);
}
The non-XPath alternative would be to iterate the children of the P tag and only output them when they are DOMText nodes:
$dom = new DOMDocument;
$dom->loadHtml($html);
foreach ($dom->getElementsByTagName('p')->item(0)->childNodes as $pChild) {
if ($pChild->nodeType === XML_TEXT_NODE) {
echo "\n<li>", trim($pChild->textContent);
}
}
Both will output (demo)
<li>2 1/2 cups sweet cherries, pitted
<li>1 tablespoon cornstarch
<li>1/4 cup fine-grain natural cane sugar
Also see DOMDocument in php for an explanation of the node concept. It's crucial to understand when working with DOM.
You can use
$lines = explode('<br>', $data->nodeValue);
here is a solution in javascript syntax
var tempArray = $line.split("<br>");
echo $line[0]
echo $line[1]
echo $line[2]
You can use the php explode function like this. (assuming each line in your <p> tag ends with <br>)
$tags = $dom->getElementsByTagName('blockquote');
foreach($tags as $tag)
{
$datas = $tag->getElementsByTagName('p');
foreach($datas as $data)
{
$contents = $data->nodeValue;
$lines = explode('<br>',$contents);
foreach($lines as $line) {
echo $line;
}
}
}

Preserving <br> tags when parsing HTML text content

I have a little issue.
I want to parse a simple HTML Document in PHP.
Here is the simple HTML :
<html>
<body>
<table>
<tr>
<td>Colombo <br> Coucou</td>
<td>30</td>
<td>Sunny</td>
</tr>
<tr>
<td>Hambantota</td>
<td>33</td>
<td>Sunny</td>
</tr>
</table>
</body>
</html>
And this is my PHP code :
$dom = new DOMDocument();
$html = $dom->loadHTMLFile("test.html");
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(0)->nodeValue.'<br />';
echo $cols->item(1)->nodeValue.'<br />';
echo $cols->item(2)->nodeValue;
}
But as you can see, I have a <br> tag and I need it, but when my PHP code runs, it removes this tag.
Can anybody explain me how I can keep it?
I would recommend you to capture the values of the table cells with help of XPath:
$values = array();
$xpath = new DOMXPath($dom);
foreach($xpath->query('//tr') as $row) {
$row_values = array();
foreach($xpath->query('td', $row) as $cell) {
$row_values[] = innerHTML($cell);
}
$values[] = $row_values;
}
Also, I've had the same problem as you with <br> tags being stripped out of fetched content for the reason that they themselves are considered empty nodes; unfortunately they're not automatically replaced with a newline character (\n);
So what I've done is designed my own innerHTML function that has proved invaluable in many projects. Here I share it with you:
function innerHTML(DOMElement $element, $trim = true, $decode = true) {
$innerHTML = '';
foreach ($element->childNodes as $node) {
$temp_container = new DOMDocument();
$temp_container->appendChild($temp_container->importNode($node, true));
$innerHTML .= ($trim ? trim($temp_container->saveHTML()) : $temp_container->saveHTML());
}
return ($decode ? html_entity_decode($innerHTML) : $innerHTML);
}

PHP Regex find text between custom added HTML Tags

I have he following scenario:
Got an HTML template file that will be used for mailing.
Here is a reduced example:
<table>
<tr>
<td>Heading 1</td>
<td>heading 2</td>
</tr>
<PRODUCT_LIST>
<tr>
<td>Value 1</td>
<td>Value 2</td>
</tr>
</PRODUCT_LIST>
</table>
All I need to do is to get the HTML code inside <PRODUCT_LIST> and then repeat that code as many times as products I have on an array.
What would be the right PHP Regex code for getting/replacing this List?
Thanks!
Assuming <PRODUCT_LIST> tags will never be nested
preg_match_all('/<PRODUCT_LIST>(.*?)<\/PRODUCT_LIST>/s', $html, $matches);
//HTML array in $matches[1]
print_r($matches[1]);
Use Simple HTML DOM Parser. It's easy to understand and use .
$html = str_get_html($content);
$el = $html->find('PRODUCT_LIST', 0);
$innertext = $el->innertext;
Use this function. It will return all found values as an array.
<?php
function get_all_string_between($string, $start, $end)
{
$result = array();
$string = " ".$string;
$offset = 0;
while(true)
{
$ini = strpos($string,$start,$offset);
if ($ini == 0)
break;
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
$result[] = substr($string,$ini,$len);
$offset = $ini+$len;
}
return $result;
}
$result = get_all_string_between($input_string, '<PRODUCT_LIST>', '</PRODUCT_LIST>');
as above is ok but with performance is really horrible
If You can use PHP 5 you can use DOM object like this :
<?php
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}
?>
and after adding this function You can just use it as:
$content = getTextBetweenTags('PRODUCT_LIST', $your_html);
foreach( $content as $item )
{
echo $item.'<br />';
}
?>
yep, i just learn this today. dont use preg for html with php5
try this regular expression in preg match all function
<PRODUCT_LIST>(.*?)<\/PRODUCT_LIST>

Categories