PHP Simple HTML DOM parser

PHP Simple HTML DOM parser - php

I am working with simple web crawler. Below is simple html code i used to learn.
input.php
<ul id="nav">
<li>
Google
<ul>
<li>
Gmail
</li>
</ul>
</li>
<li>
Yahoo
<ul>
<li>
Yahoo Mail
</li>
</ul>
</li>
</ul>
I need to crawl the first anchor tag in ul[id=nav]->li. The code i used to crawl input.php is
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
?>
It Displays all the anchor tag in my input.php. I need to display only google and yahoo. How can i achieve this?

In this case you can directly point it out with children() method. Example:
foreach($html->find('ul#nav') as $ul) {
foreach($ul->children() as $li) {
echo $li->children(0)->outertext . '<br/>';
}
}
Alternatively, you can use DOMDocument + DOMXpath for this too:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[#id="nav"]/li/a');
foreach($links as $a) {
echo $a->nodeValue . '<br/>';
}

<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
}
?>

i have done the same work in Objective-c.
You can use the XML or HTML api's to serialize your html object.
If you want to do this form cold hand... find open tag and the close tag.
After this get first child, then the second and so on...

Try this:
// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
$a = $li->find('a',0);
// retrieve the link text itself.
echo "link text: " . $a->innertext() . "\n";
}
See the simple-html-dom manual for details of all these methods.

you can simply achieve that by:
<?php
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',-2)->outertext."<br>";
}
}
?>

<?php
$in = '<style> .catalog-product-view .product.attribute.overview ul { margin-top: 10px; } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';
function parseTags($input, $callback) {
$len = strlen($input);
$stack = [];
$tag = "";
$data = "";
$isTag = false;
$isString = false;
for ($i=0; $i<$len; $i++) {
$char = $input[$i];
if ($char == '<') {
$isTag = true;
$tag .= $char;
} else if ($char == '>') {
$tag .= $char;
if (substr($tag, 0, 2) == '</') {
$close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
$open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
if ($open == $close) {
$callback($tag, $data, $stack, $i, false);
array_pop($stack);
}
} else if (substr($tag, -2) == '/>') {
$callback($tag, $data, $stack, $i, false);
} else {
$callback($tag, $data, $stack, $i, true);
$stack[] = $tag;
}
$tag = "";
$data = "";
$isTag = false;
} else if ($char == '"' || $char == "'") {
if ($isString == false) {
$isString = $char;
} else if ($isString == $char && $input[$i-1] != '\\') {
$isString = false;
}
} else if ($isTag) {
$tag .= $char;
} else {
$data .= $char;
}
}
}
parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
print_r(func_get_args());
});

Related

Why is the "Learn More" link not linking to the page?

I'm trying to understand how this code (from another developer) is written. It has a bug but I can't seem to fix it. The learn more link doesn't link to the post in the custom field.
I've tried to remove the learn more lines but it then it changes the slide link to link to the image itself and not what's present in the custom link field.
$slides = ONS_Slide_Custom_Post_Type::find_all('DESC');
if (isset($slides) && count($slides > 0)) {
$items = array();
foreach ($slides as $slide) {
//echo '<tt><pre>' . var_export($slide, true) . '</pre></tt>';
$item = new stdClass();
if (isset($slide->custom_data) && count($slide->custom_data) > 0) {
if (isset($slide->custom_data['ons_slide_image'])) {
$item->src = $slide->custom_data['ons_slide_image'];
}
if (isset($slide->custom_data['ons_slide_heading'])) {
$item->heading = $slide->custom_data['ons_slide_heading'];
$item->heading .= '<span class="punctuation">.</span><span class="learn_more"> »</span>';
}
if (isset($slide->custom_data['ons_slide_caption'])) {
$item->caption = $slide->custom_data['ons_slide_caption'];
$item->caption .= ' Learn more »';
}
if (isset($slide->custom_data['ons_slide_href'])) {
$item->href = $slide->custom_data['ons_slide_href'];
} else {
$item->href = "#";
}
}
$items[] = $item;
}
$carousel = new ONS_Bootstrap_Carousel($items);
echo $carousel;
}

You are already doing something with $slide->custom_data['ons_slide_href']; but AFTER the lines of code that are outputting your anchor tag.
So try switching the processing about a bit like this
$slides = ONS_Slide_Custom_Post_Type::find_all('DESC');
if (isset($slides) && count($slides > 0)) {
$items = array();
foreach ($slides as $slide) {
//echo '<tt><pre>' . var_export($slide, true) . '</pre></tt>';
$item = new stdClass();
if (isset($slide->custom_data) && count($slide->custom_data) > 0) {
if (isset($slide->custom_data['ons_slide_image'])) {
$item->src = $slide->custom_data['ons_slide_image'];
}
if (isset($slide->custom_data['ons_slide_heading'])) {
$item->heading = $slide->custom_data['ons_slide_heading'];
$item->heading .= '<span class="punctuation">.</span><span class="learn_more"> »</span>';
}
// moved this code above the anchor tag line
if (isset($slide->custom_data['ons_slide_href'])) {
$item->href = $slide->custom_data['ons_slide_href'];
} else {
$item->href = "#";
}
// Now concatenate $item->href in the anchor tag line
if (isset($slide->custom_data['ons_slide_caption'])) {
$item->caption = $slide->custom_data['ons_slide_caption'];
$item->caption .= ' Learn more »';
}
}
$items[] = $item;
}
$carousel = new ONS_Bootstrap_Carousel($items);
echo $carousel;
}

PHP - echo the opening and closing of tags from array

Here I have an array and there is content situated inside it, their is either, one object, two, or more - depending on the tags required; the first element nested in the multidimensional array would be the textual output, unless it is an array, then the first element inside the array would be the text.
However, the other content in the array is in reference to the HTML tag they correspond to, such as:
[1] => Array
(
[0] => bo <====== Text to output
[1] => bold <====== tag to be within
)
However, in the module of simplicity, I would prefer for the content not to constantly repeat such responses, like:
This is a test <b>bo</b><i><b>ld</b></i><i>,</i> <u><i>u</i></u><u>nderline</u> ...
Instead the output should be:
This is a test<b>bo<i>ld</i></b><i>, <u>u</u></i><u>nderline</u> ...
This is the PHP code I have for it so far...
$use = array();
$base = "";
foreach ($build as $part => $data) {
// print_r($use);
if(!is_array($data)){
$base .= $data;
} else {
$text = array_shift($data);
if(!is_array($data[0])){
$data = array($data[0]);
} else {
$data = $data[0];
}
$removed = array_diff($use,$data);
foreach (($data) as $tag) {
if (in_array($tag, array_diff($use,$data))) {
$base .= "<\/" . $tag . ">";
} elseif(!in_array($tag, $use)){
$base .= "<" . $tag . ">";
array_push($use, $tag);
}
}
$use = $data;
$base .= $text;
}
}
print_r($base);
And here is the array if required (in JSON format!):
["This is a test\nIncluding ",["bo","bold"],["ld",["italic","bold"]],[", ","italic"],["u",["underline","italic"]],["nderlined","underline"],", ",["strike-through","strike"],", and ",["italic","italic"],"\ntext:\n\n",["numbered lists",["underline","strike","italic","bold"]],["\n",[]],"as well as",["\n",[]],["non ordered lists","http:\/\/test.com"],["\n",[]],"it works very well",["\n",[]],["try it","http:\/\/google.com"],"\n",["http:\/\/google.com",["bold","http:\/\/google.com"]],"\n\n",["wow","bold"],"\n",["lol","bold"]]
Any help would be much appreciated... thanks!

I'm honestly not sure if this is exactly what you're looking for. It would be great to have a full desired output... but I believe this is as close as it gets. It took me 3 hours so it better be it. It's a great question, very hard to accomplish.
I did print_r(htmlentities($base)), but you can simply do print_r($base) to see the formatted result. I did that because it was easier to check with the output you provided in the question.
Also, I modified your JSON because some tags specified there are non-existent. For example, I changed underline for u, italic for i, bold for b. Alternatives are em, strong... anyway, that's just a side-note.
<?php
$build = json_decode('["This is a test\nIncluding ",["bo","b"],["ld",["i","b"]],[", ","i"],["u",["u","i"]],["nderlined","u"],", ",["strike-through","strike"],", and ",["italic","i"],"\ntext:\n\n",["numbered lists",["u","strike","i","b"]],["\n",[]],"as well as",["\n",[]],["non ordered lists","http:\/\/test.com"],["\n",[]],"it works very well",["\n",[]],["try it","http:\/\/google.com"],"\n",["http:\/\/google.com",["b","http:\/\/google.com"]],"\n\n",["wow","b"],"\n",["lol","b"]]', true);
$used = [];
$base = '';
foreach($build as $data){
if(is_array($data)){
$text = array_shift($data);
$tags = $data[0];
if(!is_array($data[0])){
$tags = [$data[0]];
}
$elements = '';
$tagsToClose = array_diff($used, $tags);
$changes = true;
$i = 0;
foreach($tagsToClose as $tag){
while($changes){
$changes = false;
if($lastOpened != $tag){
$changes = true;
$elements .= '</'.$lastOpened.'>';
unset($used[$i++]);
$lastOpened = $used[$i];
}
}
$elements .= '</'.$tag.'>';
$key = array_search($tag, $used);
unset($used[$key]);
}
foreach($tags as $tag){
if(!in_array($tag, $used)){
$elements .= '<'.$tag.'>';
array_unshift($used, $tag);
$lastOpened = $tag;
}
}
$elements .= $text;
$data = $elements;
}
$base .= $data;
}
unset($used);
$base .= '</'.$lastOpened.'>';
print_r(htmlentities($base));
?>
EDIT
And here's the result I got, just in case you run into some trouble testing or to check with your results or whatever:
This is a test Including <b>bo<i>ld</i></b><i>, <u>u</u></i><u>nderlined, </u><strike>strike-through, and </strike><i>italic text: <u><strike><b>numbered lists</b></strike></u></i> as well as <http://test.com>non ordered lists</http://test.com> it works very well <http://google.com>try it <b>http://google.com </b></http://google.com><b>wow lol</b>

After many hours, this was my solution that I ended up with:
$build = json_decode('["This is a test Including\u00a0",["bo","bold"],["ld",["italic","bold"]],[",\u00a0","italic"],["u",["underline","italic"]],["nderlined,\u00a0","underline"],"strike-through, and\u00a0",["italic text:\u00a0","italic"],"it works very well\u00a0try it\u00a0",["http:\/\/google.com",["bold","http:\/\/google.com"]],["\u00a0wow lol","bold"]]',true);
$standard = array("bold"=>"b","underline"=>"u","strike"=>"s","italic"=>"i","link"=>"a","size"=>null);
$lists = array("ordered"=>"ol","bullet"=>"ul");
$size = array("huge"=>"2.5em","large"=>"1.5em");
$base = "";
foreach($build as $part){
$use = array();
$tags = true;
$len = 1;
if(!is_array($part) or count($part) == 1){
$text = $part;
$tags = false;
$part = array();
} else {
$text = array_shift($part);
if(count($part) == 1){
if(is_array($part[0])){
$part = $part[0];
}
}
if(!is_array($part)){
$part = array($part);
}
}
if($tags){
foreach ($part as $tag) {
if(!in_array($tag, array_keys($standard)) && !in_array($tag, array_keys($lists)) && !in_array($tag, array_keys($size))){
$base .= '<a href="' . $tag . '" title="' . $tag . '" class="link">';
$tag = "link";
} elseif(in_array($tag, array_keys($size))){
$base .= "<span style='font-size:" . $size[$tag] . "'>";
} elseif(!in_array($tag, array_keys($lists))) {
$base .= "<" . $standard[$tag] . ">";
}
array_push($use, $tag);
}
$base .= $text;
foreach (array_reverse($part) as $tag) {
if(!in_array($tag, array_keys($standard)) && !in_array($tag, array_keys($lists)) && !in_array($tag, array_keys($size))){
$base .= '</a>';
} elseif(in_array($tag, array_keys($size))){
$base .= "</span>";
} elseif (!in_array($tag, array_keys($lists))) {
$base .= "</" . $standard[$tag] . ">";
}
array_push($use, $tag);
}
} else {
$base .= $text;
}
}
print_r($base);

php DOMDocument extract links with anchor or alt

I which to extract all the link include on page with anchor or alt attribute on image include in the links if this one come first.
$html = 'Anchor';
Must return "lien.fr;Anchor"
$html = '<img alt="Alt Anchor">Anchor';
Must return "lien.fr;Alt Anchor"
$html = 'Anchor<img alt="Alt Anchor">';
Must return "lien.fr;Anchor"
I did:
$doc = new DOMDocument();
$doc->loadHTML($html);
$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
$href = $img_alt = $anchor = "";
$href = $element->getAttribute('href');
$n++;
if (!strrpos($href, "panier?")) {
if ($element->firstChild->nodeName == "img") {
$imgs = $element->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($anchor = $img->getAttribute('alt')) {
break;
}
}
}
if (($anchor == "") && ($element->nodeValue)) {
$anchor = $element->nodeValue;
}
$out[$n]['link'] = $href;
$out[$n]['anchor'] = $anchor;
}
}
This seems to work but if there some space or indentation it doesn't
as
$html = '<a href="link.fr">
<img src="ceinture-gris" alt="alt anchor"/>
</a>';
the $element->firstChild->nodeName will be text

Something like this:
$doc = new DOMDocument();
$doc->loadHTML($html);
// Output texts that will later be joined with ';'
$out = [];
// Maximum number of items to add to $out
$max_out_items = 2;
// List of img tag attributes that will be parsed by the loop below
// (in the order specified in this array!)
$img_attributes = ['alt', 'src', 'title'];
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
if ($href = trim($element->getAttribute('href'))) {
$out []= $href;
if (count($out) >= $max_out_items)
break;
}
foreach ($element->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE &&
$text = trim($child->nodeValue))
{
$out []= $text;
if (count($out) >= $max_out_items)
break;
} elseif ($child->nodeName == 'img') {
foreach ($img_attributes as $attr_name) {
if ($attr_value = trim($child->getAttribute($attr_name))) {
$out []= $attr_value;
if (count($out) >= $max_out_items)
goto Result;
}
}
}
}
}
Result:
echo $out = implode(';', $out);

Changed formation while use character limit in TBS library [duplicate]

I have various HTML strings to cut to 100 characters (of the stripped content, not the original) without stripping tags and without breaking HTML.
Original HTML string (288 characters):
$content = "<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the air
<span>everywhere</span>, it's a HTML taggy kind of day.</strong></div>";
Standard trim: Trim to 100 characters and HTML breaks, stripped content comes to ~40 characters:
$content = substr($content, 0, 100)."..."; /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove... */
Stripped HTML: Outputs correct character count but obviously looses formatting:
$content = substr(strip_tags($content)), 0, 100)."..."; /* output:
With a span over here and a nested div over there and a lot of other nested
texts and tags in the ai... */
Partial solution: using HTML Tidy or purifier to close off tags outputs clean HTML but 100 characters of HTML not displayed content.
$content = substr($content, 0, 100)."...";
$tidy = new tidy; $tidy->parseString($content); $tidy->cleanRepair(); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div ove</div></div>... */
Challenge: To output clean HTML and n characters (excluding character count of HTML elements):
$content = cutHTML($content, 100); /* output:
<div>With a <span class='spanClass'>span over here</span> and a
<div class='divClass'>nested div over <div class='nestedDivClass'>there</div>
</div> and a lot of other nested <strong><em>texts</em> and tags in the
ai</strong></div>...";
Similar Questions
How to clip HTML fragments without breaking up tags
Cutting HTML strings without breaking HTML tags

Not amazing, but works.
function html_cut($text, $max_length)
{
$tags = array();
$result = "";
$is_open = false;
$grab_open = false;
$is_close = false;
$in_double_quotes = false;
$in_single_quotes = false;
$tag = "";
$i = 0;
$stripped = 0;
$stripped_text = strip_tags($text);
while ($i < strlen($text) && $stripped < strlen($stripped_text) && $stripped < $max_length)
{
$symbol = $text{$i};
$result .= $symbol;
switch ($symbol)
{
case '<':
$is_open = true;
$grab_open = true;
break;
case '"':
if ($in_double_quotes)
$in_double_quotes = false;
else
$in_double_quotes = true;
break;
case "'":
if ($in_single_quotes)
$in_single_quotes = false;
else
$in_single_quotes = true;
break;
case '/':
if ($is_open && !$in_double_quotes && !$in_single_quotes)
{
$is_close = true;
$is_open = false;
$grab_open = false;
}
break;
case ' ':
if ($is_open)
$grab_open = false;
else
$stripped++;
break;
case '>':
if ($is_open)
{
$is_open = false;
$grab_open = false;
array_push($tags, $tag);
$tag = "";
}
else if ($is_close)
{
$is_close = false;
array_pop($tags);
$tag = "";
}
break;
default:
if ($grab_open || $is_close)
$tag .= $symbol;
if (!$is_open && !$is_close)
$stripped++;
}
$i++;
}
while ($tags)
$result .= "</".array_pop($tags).">";
return $result;
}
Usage example:
$content = html_cut($content, 100);

I'm not claiming to have invented this, but there is a very complete Text::truncate() method in CakePHP which does what you want:
function truncate($text, $length = 100, $ending = '...', $exact = true, $considerHtml = false) {
if (is_array($ending)) {
extract($ending);
}
if ($considerHtml) {
if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
return $text;
}
$totalLength = mb_strlen($ending);
$openTags = array();
$truncate = '';
preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
foreach ($tags as $tag) {
if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
array_unshift($openTags, $tag[2]);
} else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
$pos = array_search($closeTag[1], $openTags);
if ($pos !== false) {
array_splice($openTags, $pos, 1);
}
}
}
$truncate .= $tag[1];
$contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
if ($contentLength + $totalLength > $length) {
$left = $length - $totalLength;
$entitiesLength = 0;
if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
foreach ($entities[0] as $entity) {
if ($entity[1] + 1 - $entitiesLength <= $left) {
$left--;
$entitiesLength += mb_strlen($entity[0]);
} else {
break;
}
}
}
$truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
break;
} else {
$truncate .= $tag[3];
$totalLength += $contentLength;
}
if ($totalLength >= $length) {
break;
}
}
} else {
if (mb_strlen($text) <= $length) {
return $text;
} else {
$truncate = mb_substr($text, 0, $length - strlen($ending));
}
}
if (!$exact) {
$spacepos = mb_strrpos($truncate, ' ');
if (isset($spacepos)) {
if ($considerHtml) {
$bits = mb_substr($truncate, $spacepos);
preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
if (!empty($droppedTags)) {
foreach ($droppedTags as $closingTag) {
if (!in_array($closingTag[1], $openTags)) {
array_unshift($openTags, $closingTag[1]);
}
}
}
}
$truncate = mb_substr($truncate, 0, $spacepos);
}
}
$truncate .= $ending;
if ($considerHtml) {
foreach ($openTags as $tag) {
$truncate .= '</'.$tag.'>';
}
}
return $truncate;
}

Use PHP's DOMDocument class to normalize an HTML fragment:
$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));
This question is similar to an earlier question and I've copied and pasted one solution here. If the HTML is submitted by users you'll also need to filter out potential Javascript attack vectors like onmouseover="do_something_evil()" or .... Tools like HTML Purifier were designed to catch and solve these problems and are far more comprehensive than any code that I could post.

I made another function to do it, it supports UTF-8:
/**
* Limit string without break html tags.
* Supports UTF8
*
* #param string $value
* #param int $limit Default 100
*/
function str_limit_html($value, $limit = 100)
{
if (mb_strwidth($value, 'UTF-8') <= $limit) {
return $value;
}
// Strip text with HTML tags, sum html len tags too.
// Is there another way to do it?
do {
$len = mb_strwidth($value, 'UTF-8');
$len_stripped = mb_strwidth(strip_tags($value), 'UTF-8');
$len_tags = $len - $len_stripped;
$value = mb_strimwidth($value, 0, $limit + $len_tags, '', 'UTF-8');
} while ($len_stripped > $limit);
// Load as HTML ignoring errors
$dom = new DOMDocument();
#$dom->loadHTML('<?xml encoding="utf-8" ?>'.$value, LIBXML_HTML_NODEFDTD);
// Fix the html errors
$value = $dom->saveHtml($dom->getElementsByTagName('body')->item(0));
// Remove body tag
$value = mb_strimwidth($value, 6, mb_strwidth($value, 'UTF-8') - 13, '', 'UTF-8'); // <body> and </body>
// Remove empty tags
return preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $value);
}
SEE DEMO.
I recommend use html_entity_decode at the start of function, so it preserves the UTF-8 characters:
$value = html_entity_decode($value);

Use a HTML parser and stop after 100 characters of text.

You should use Tidy HTML. You cut the string and then you run Tidy to close the tags.
(Credits where credits are due)

Regardless of the 100 count issues you state at the beginning, you indicate in the challenge the following:
output the character count of
strip_tags (the number of characters
in the actual displayed text of the
HTML)
retain HTML formatting close
any unfinished HTML tag
Here is my proposal:
Bascially, I parse through each character counting as I go. I make sure NOT to count any characters in any HTML tag. I also check at the end to make sure I am not in the middle of a word when I stop. Once I stop, I back track to the first available SPACE or > as a stopping point.
$position = 0;
$length = strlen($content)-1;
// process the content putting each 100 character section into an array
while($position < $length)
{
$next_position = get_position($content, $position, 100);
$data[] = substr($content, $position, $next_position);
$position = $next_position;
}
// show the array
print_r($data);
function get_position($content, $position, $chars = 100)
{
$count = 0;
// count to 100 characters skipping over all of the HTML
while($count <> $chars){
$char = substr($content, $position, 1);
if($char == '<'){
do{
$position++;
$char = substr($content, $position, 1);
} while($char !== '>');
$position++;
$char = substr($content, $position, 1);
}
$count++;
$position++;
}
echo $count."\n";
// find out where there is a logical break before 100 characters
$data = substr($content, 0, $position);
$space = strrpos($data, " ");
$tag = strrpos($data, ">");
// return the position of the logical break
if($space > $tag)
{
return $space;
} else {
return $tag;
}
}
This will also count the return codes etc. Considering they will take space, I have not removed them.

Here is a function I'm using in one of my projects. It's based on DOMDocument, works with HTML5 and is about 2x faster than other solutions I've tried (at least on my machine, 0.22 ms vs 0.43 ms using html_cut($text, $max_length) from the top answer on a 500 text-node-characters string with a limit of 400).
function cut_html ($html, $limit) {
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding("<div>{$html}</div>", "HTML-ENTITIES", "UTF-8"), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
cut_html_recursive($dom->documentElement, $limit);
return substr($dom->saveHTML($dom->documentElement), 5, -6);
}
function cut_html_recursive ($element, $limit) {
if($limit > 0) {
if($element->nodeType == 3) {
$limit -= strlen($element->nodeValue);
if($limit < 0) {
$element->nodeValue = substr($element->nodeValue, 0, strlen($element->nodeValue) + $limit);
}
}
else {
for($i = 0; $i < $element->childNodes->length; $i++) {
if($limit > 0) {
$limit = cut_html_recursive($element->childNodes->item($i), $limit);
}
else {
$element->removeChild($element->childNodes->item($i));
$i--;
}
}
}
}
return $limit;
}

Here is my try at the cutter. Maybe you guys can catch some bugs. The problem, i found with the other parsers, is that they don't close tags properly and they cut in the middle of a word (blah)
function cutHTML($string, $length, $patternsReplace = false) {
$i = 0;
$count = 0;
$isParagraphCut = false;
$htmlOpen = false;
$openTag = false;
$tagsStack = array();
while ($i < strlen($string)) {
$char = substr($string, $i, 1);
if ($count >= $length) {
$isParagraphCut = true;
break;
}
if ($htmlOpen) {
if ($char === ">") {
$htmlOpen = false;
}
} else {
if ($char === "<") {
$j = $i;
$char = substr($string, $j, 1);
while ($j < strlen($string)) {
if($char === '/'){
$i++;
break;
}
elseif ($char === ' ') {
$tagsStack[] = substr($string, $i, $j);
}
$j++;
}
$htmlOpen = true;
}
}
if (!$htmlOpen && $char != ">") {
$count++;
}
$i++;
}
if ($isParagraphCut) {
$j = $i;
while ($j > 0) {
$char = substr($string, $j, 1);
if ($char === " " || $char === ";" || $char === "." || $char === "," || $char === "<" || $char === "(" || $char === "[") {
break;
} else if ($char === ">") {
$j++;
break;
}
$j--;
}
$string = substr($string, 0, $j);
foreach($tagsStack as $tag){
$tag = strtolower($tag);
if($tag !== "img" && $tag !== "br"){
$string .= "</$tag>";
}
}
$string .= "...";
}
if ($patternsReplace) {
foreach ($patternsReplace as $value) {
if (isset($value['pattern']) && isset($value["replace"])) {
$string = preg_replace($value["pattern"], $value["replace"], $string);
}
}
}
return $string;
}

try this function
// trim the string function
function trim_word($text, $length, $startPoint=0, $allowedTags=""){
$text = html_entity_decode(htmlspecialchars_decode($text));
$text = strip_tags($text, $allowedTags);
return $text = substr($text, $startPoint, $length);
}
and
echo trim_word("<h2 class='zzzz'>abcasdsdasasdas</h2>","6");

I know this is quite old, but I've recently made a small class for cutting HTML for previews: https://github.com/Simbiat/HTMLCut/
Why would you want to use that instead of the other suggestions? Here are a few things that come to my mind (taken from readme):
Preserve HTML tags, unless they are empty.
Preserve words.
Remove some orphaned punctuation signs at the end of the cut string.
Remove HTML tags, that you would not want in a preview (optional).
Limit number of paragraphs (optional).
Add an ellipsis if text was cut (optional).
Class operates with DOM, but also uses Regex in some places (mainly for cutting and trimming). Perhaps it can be of use to some.

Try the following:
<?php echo strip_tags(mb_strimwidth($VARIABLE_HERE, 0, 160, "...")); ?>
This will strip the HTML (strip_tags) amd limit characters (mb_strimwidth) to 160 characters

Recursive function - tree view - <ul> <li> ... (stuck)

It seems that I'm stuck with my recursive function.
I have a problem with closing the unnamed list (</ul>) and the list-items (</li>)
The thing what i get is
-aaa
-bbb
-b11
-b22
-b33
-ccc
-c11
-c22
-c33
-ddd
-d11
-d22
-d33
-eee
-fff
And the thing what i want is:
-aaa
-bbb
-b11
-b22
-b2a
-b2c
-b2b
-b33
-ccc
-c11
-c22
-c33
-c2a
-c2c
-c2c1
-c2c2
-c2b
-ddd
-d11
-d22
-d33
-eee
-fff
This is the code that i'm using
$html .= '<ul>';
$i = 0;
foreach ($result as $item)
{
$html .= "<li>$item->id";
$html .= getSubjects($item->id, NULL, "",$i); <--- start
$html .= "</li>";
}
$html .= '</ul>';
And the function
function getSubjects($chapter_id = NULL, $subject_id = NULL, $string = '', $i = 0 ) {
$i++;
// getting the information out of the database
// Depending of his parent was a chapter or a subject
$query = db_select('course_subject', 'su');
//JOIN node with users
$query->join('course_general_info', 'g', 'su.general_info_id = g.id');
// If his parent was a chapter - get all the values where chapter id = ...
if ($chapter_id != NULL) {
$query
->fields('g', array('short_title', 'general_id'))
->fields('su', array('id'))
->condition('su.chapter_id', $chapter_id, '=');
$result = $query->execute();
}
// if the parent is a subject -
// get value all the values where subject id = ...
else {
$query
->fields('g', array('short_title', 'general_id'))
->fields('su', array('id'))
->condition('su.subject_id', $subject_id, '=');
$result = $query->execute();
}
// Because count doesn't work (drupal)
$int = 0;
foreach ($result as $t) {
$int++;
}
// if there no values in result - than return the string
if ($int == 0) {
return $string;
}
else {
// Creating a new <ul>
$string .= "<ul>";
foreach ($result as $item) {
// change the id's
$subject_id = $item->id;
$chapter_id = NULL;
// and set the string --> with the function to his own function
$string .= "<li>$item->short_title - id - $item->id ";
getSubjects(NULL, $subject_id, $string, $i);
$string .="</li>";
}
$string .= "</ul>";
}
// I thougt that this return wasn't necessary
return $string;
}
Does someone have more experience with this kind of things?
All help is welcome.

I am not sure what you are trying to do but here is some code you can test and see if it helps to solve your problem:
This part is just for testing, it makes three dimensional array for testing:
for ($x = 0; $x < 2; $x++) {
$result["c$x"] = "ROOT-{$x}";
for ($y = 0; $y < 3; $y++) {
$result[$x]["c$y"] = "SECOND-{$x}-{$y}";
$rnd_count1 = rand(0,3);
for ($z = 0; $z < $rnd_count1; $z++) {
$result[$x][$y]["c$z"] = "RND-{$x}-{$y}-{$z}";
$rnd_count2 = rand(0,4);
for ($c = 0; $c < $rnd_count2; $c++) {
$result[$x][$y][$z][$c] = "LAST-{$x}-{$y}-{$z}-{$c}";
}
}
}
}
// $result is now four dimensional array with some values
// Last two levels gets random count starting from 0 items.
UPDATE:
Added some randomness and fourth level to test array.
And here is function which sorts array to unordered list:
function recursive(array $array, $list_open = false){
foreach ($array as $item) {
if (is_array($item)) {
$html .= "<ul>\n";
$html .= recursive($item, true);
$html .= "</ul>\n";
$list_open = false;
} else {
if (!$list_open) {
$html .= "<ul>\n";
$list_open = true;
}
$html .= "\t<li>$item</li>\n";
}
}
if ($list_open) $html .= "</ul>\n";
return $html;
}
// Then test run, output results to page:
echo recursive($result);
UPDATE:
Now it should open and close <ul> tags properly.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Simple HTML DOM parser - php

<?php include 'simple_html_dom.php'; $html = file_get_html('input.php'); foreach ($html->find('ul[id=nav]') as $navUL){ foreach ($navUL->find('li') as $navUL_LI){ if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){ echo $navUL_LI->find('a',0)->outertext."<br>"; } } } ?>

i have done the same work in Objective-c. You can use the XML or HTML api's to serialize your html object. If you want to do this form cold hand... find open tag and the close tag. After this get first child, then the second and so on...

you can simply achieve that by: <?php foreach ($html->find('ul[id=nav]') as $navUL){ foreach ($navUL->find('li') as $navUL_LI){ echo $navUL_LI->find('a',-2)->outertext."<br>"; } } ?>

Related

Why is the "Learn More" link not linking to the page?

PHP - echo the opening and closing of tags from array

php DOMDocument extract links with anchor or alt

Changed formation while use character limit in TBS library [duplicate]

Recursive function - tree view - <ul> <li> ... (stuck)

Categories

Resources