I am trying to parse screen-scraped data using Zend_Dom_Query, but I am struggling how to apply it properly for my case, and all other answers I have seen on SO make assumptions that quite frankly scare me with their naiveté.
A typical example is How to Pass Array from Zend Dom Query Results to table where pairs of data points are being extracted from the documents body through the use of separate calls to the query() method.
$year = $dom->query('.secondaryInfo');
$rating = $dom->query('.ratingColumn');
Where the underlying assumptions are that an equal number of $year and $rating results exist AND that they are correctly aligned with each other within the document. If either of those assumptions are wrong, then the extracted data is less than worthless - in fact it becomes all lies.
In my case I am trying to extract multiple chunks of data from a site, where each chunk is nominally of the form:
<p class="main" atrb1="value1">
<a href="#1" >href text 1</a>
<span class="sub1">
<span class="span1"></span>
<span class="sub2">
<span class="span2">data span2</span>
href text 2
</span>
<span class="sub3">
<span class="span3">
<p>Some other data</p>
<span class="sub4">
<span class="sub5">More data</span>
</span>
</span>
</span>
</span>
</p>
For each chunk, I need to grab data from various sections:
".main"
".main a"
".main .span2"
".main .sub2 a"
".main .span3 p"
etc
And then process the set of data as one distinct unit, and not as multiple collections of different data.
I know I can hard code the selection of each element (and I currently do that), but that produces brittle code reliant on the source data being stable. And this week the data source yet again changed and I was bitten by my hard coded scraping failing to work. Thus I am trying to write robust code that can locate what I want without me having to care/know about the overall structure (Hmmm - Linq for php?)
So in my mind, I want the code to look something like
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $result->query(".main a");
$data2 = $result->query(".main .span2");
$data3 = $result->query(".main .sub a");
etc
if ($data1 && $data2 && $data3) {
Do something
} else {
Do something else
}
}
Is it possible to do what I want with stock Zend/PHP function calls? Or do I need to write some sort of custom function to implement $result->query()?
OK .. so I bit the bullet and wrote my own solution to the problem. This code recurses through the results from the Zend_Dom_Query and looks for matching css selectors. As presented the code works for me and has also helped clean up my code. Performance wasn't an issue for me, but as always Caveat Emptor. I have also left in some commented out code that enables visualization of where the search is leading. The code was also part of a class, hence the use of $this-> in places.
The code is used as:
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $this->domQuery($result, ".sub2 a");
if (!is_null($data1))
{
Do Something
}
}
Which finds the href text 2 element under the <span class="sub2"> element.
// Function that recurses through a Zend_Dom_Query_Result, looking for css selectors
private function recurseDomQueryResult($dom, $depth, $targets, $index, $count)
{
// Gross checking
if ($index<0 || $index >= $count) return NULL;
// Document where we are
$element = $dom->nodeName;
$class = NULL;
$id = NULL;
// $href = NULL;
// Skip unwanted elements
if ($element == '#text') return NULL;
if ($dom->hasAttributes()) {
if ($dom->hasAttribute('class'))
{
$class = trim($dom->getAttribute('class'));
}
if ($dom->hasAttribute('id'))
{
$id = trim($dom->getAttribute('id'));
}
// if ($element == 'a')
// {
// if ($dom->hasAttribute('href'))
// {
// $href = trim($dom->getAttribute('href'));
// }
// }
}
// $padding = str_repeat('==', $depth);
// echo "$padding<$element";
// if (!($class === NULL)) echo ' class="'.$class.'"';
// if (!($href === NULL)) echo ' href="'.$href.'"';
// echo '><br />'. "\n";
// See if we have a match for the current element
$target = $targets[$index];
$sliced = substr($target,1);
switch($target[0])
{
case '.':
if ($sliced === $class) {
$index++;
}
break;
case '#':
if ($sliced === $id) {
$index++;
}
break;
default:
if ($target === $element) {
$index++;
}
break;
}
// Check for having matched all
if ($index == $count) return $dom;
// We didn't have a match at this level
// So recursively look at all the children
$children = $dom->childNodes;
if ($children) {
foreach($children as $child)
{
if (!is_null(($result = $this->recurseDomQueryResult($child, $depth+1, $targets, $index, $count)))) return $result;
}
}
// Did not find anything
// echo "$padding</$element><br />\n";
return NULL;
}
// User function that you call to find a single element in a Zend_Dom_Query_Result
// $dom is the Zend_Dom_Query_Result object
// $path is a path of css selectors, e.g. ".sub2 a"
private function domQuery($dom, $path)
{
$depth = 0;
$index = 0;
$targets = explode(' ', $path);
$count = count($targets);
return $this->recurseDomQueryResult($dom, $depth, $targets, $index, $count);
}
Related
Now I am trying to write a PhP parser and I don't why my code return an empty array. I am using PHP Simple HTML DOM. I know my code is't perfect, but it's only for testing.
I will be appreciate for any help
public function getData() {
// get url form urls.txt
foreach ($this->list_url as $i => $url) {
// create a DOM object from a HTML file
$this->html = file_get_html($url);
// find array all elements with class="name" because every products having name
$products = $this->html->find(".name");
foreach ($products as $number => $product) {
// get value attr a=href product
$href = $this->html->find("div.name a", $number)->attr['href'];
// create a DOM object form a HTML file
$html = file_get_html($href);
if($html && is_object($html) && isset($html->nodes)){
echo "TRUE - all goodly";
} else{
echo "FALSE - all badly";
}
// get all elements class="description"
// $nodes is empty WHY? Tough web-page having content and div.description?
$nodes = $html->find('.description');
if (count($nodes) > 0) {
$needle = "Производитель:";
foreach ($nodes as $short_description) {
if (stripos($short_description->plaintext, $needle) !== FALSE) {
echo "TRUE";
$this->data[] = $short_description->plaintext;
} else {
echo "FALSE";
}
}
} else {
$this->data[] = '';
}
$html->clear();
unset($html);
}
$this->html->clear();
unset($html);
}
return $this->data;
}
hi you should inspect the element and copy->copy selector and use it in find method to getting the object
I am using the following code to place some ad code inside my content .
<?php
$content = apply_filters('the_content', $post->post_content);
$content = explode (' ', $content);
$halfway_mark = ceil(count($content) / 2);
$first_half_content = implode(' ', array_slice($content, 0, $halfway_mark));
$second_half_content = implode(' ', array_slice($content, $halfway_mark));
echo $first_half_content.'...';
echo ' YOUR ADS CODE';
echo $second_half_content;
?>
How can i modify this so that the 2 paragraphs (top and bottom) enclosing the ad code should not be the one having images. If the top or bottom paragraph has image then try for next 2 paragraphs.
Example: Correct Implementation on the right.
preg_replace version
This code steps through every paragraph ignoring those that contain image tags. The $pcount variable is incremented for every paragraph found without an image, if an image is encountered however, $pcount is reset to zero. Once $pcount reaches the point where it would hit two, the advert markup is inserted just before that paragraph. This should leave the advert markup between two safe paragraphs. The advert markup variable is then nullified so only one advert is inserted.
The following code is just for set up and could be modified to split the content differently, you could also modify the regular expression that is used — just in case you are using double BRs or something else to delimit your paragraphs.
/// set our advert content
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
/// calculate mid point
$mpoint = floor(strlen($content) / 2);
/// modify back to the start of a paragraph
$mpoint = strripos($content, '<p', -$mpoint);
/// split html so we only work on second half
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
The rest is the bulk of the code that runs the replacement. This could be modified to insert more than one advert, or to support more involved image checking.
$content = $first . preg_replace_callback($regexp, function($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}, $second);
After this code has been executed the $content variable will contain the enhanced HTML.
PHP versions prior to 5.3
As your chosen testing area does not support PHP 5.3, and so does not support anonymous functions, you need to use a slightly modified and less succinct version; that makes use of a named function instead.
Also, in order to support content that may not actually leave space for the advert in it's second half I have modified the $mpoint so that it is calculated to be 80% from the end. This will have the effect of including more in the $second part — but will also mean your adverts will be generally placed higher up in the mark-up. This code has not had any fallback implemented into it, because your question does not mention what should happen in the event of failure.
$advert = '<marquee>BUY THIS STUFF!!</marquee>' . "\n\n";
$mpoint = floor(strlen($content) * 0.8);
$mpoint = strripos($content, '<p', -$mpoint);
$first = substr($content, 0, $mpoint);
$second = substr($content, $mpoint);
$pcount = 0;
$regexp = '/<p>.+?<\/p>/si';
function replacement_callback($matches){
global $pcount, $advert;
if ( !$advert ) {
$return = $matches[0];
}
else if ( stripos($matches[0], '<img ') !== FALSE ) {
$return = $matches[0];
$pcount = 0;
}
else if ( $pcount === 1 ) {
$return = $advert . $matches[0];
$advert = '';
}
else {
$return = $matches[0];
$pcount++;
}
return $return;
}
echo $first . preg_replace_callback($regexp, 'replacement_callback', $second);
You could try this:
<?php
$ad_code = 'SOME SCRIPT HERE';
// Your code.
$content = apply_filters('the_content', $post->post_content);
// Split the content at the <p> tags.
$content = explode ('<p>', $content);
// Find the mid of the article.
$content_length = count($content);
$content_mid = floor($content_length / 2);
// Save no image p's index.
$last_no_image_p_index = NULL;
// Loop beginning from the mid of the article to search for images.
for ($i = $content_mid; $i < $content_length; $i++) {
// If we do not find an image, let it go down.
if (stripos($content[$i], '<img') === FALSE) {
// In case we already have a last no image p, we check
// if it was the one right before this one, so we have
// two p tags with no images in there.
if ($last_no_image_p_index === ($i - 1)) {
// We break here.
break;
}
else {
$last_no_image_p_index = $i;
}
}
}
// If no none image p tag was found, we use the last one.
if (is_null($last_no_image_p_index)) {
$last_no_image_p_index = ($content_length - 1);
}
// Add ad code here with trailing <p>, so the implode later will work correctly.
$content = array_slice($content, $last_no_image_p_index, 0, $ad_code . '</p>');
$content = implode('<p>', $content);
?>
It will try to find a place for the ad from the mid of your article and if none is found the ad is put to the end.
Regards
func0der
I think this will work:
First explode the paragraphs, then you have to loop it and check if you find img inside them.
If you find it inside, you try the next.
Think of this as psuedo-code, since it's not tested. You will have to make a loop too, comments in the code :) Sorry if it contains bugs, it's written in Notepad.
<?php
$i = 0; // counter
$arrBoolImg = array(); // array for the paragraph booleans
$content = apply_filters('the_content', $post->post_content);
$contents = str_replace ('<p>', '<explode><p>', $content); // here we add a custom tag, so we can explode
$contents = explode ('<explode>', $contents); // then explode it, so we can iterate the paragraphs
// fill array with boolean array returned
$arrBoolImg = hasImages($contents);
$halfway_mark = ceil(count($contents) / 2);
/*
TODO (by you):
---
When you have $arrBoolImg filled, you can itarate through it.
You then simply loop from the middle of the array $contents (plural), that is exploded from above.
The startingpoing for your loop is the middle, the upper bounds is the +2 or what ever :-)
Then you simply insert your magic.. And then glue it back together, as you did before.
I think this will work. even though the code may have some bugs, since I wrote it in Notepad.
*/
function hasImages($contents) {
/*
This function loops through the $contents array and checks if they have images in them
The return value, is an array with boolean values, so one can iterate through it.
*/
$arrRet = array(); // array for the paragraph booleans
if (count($content)>=1) {
foreach ($contents as $v) { // iterate the content
if (strpos($v, '<img') === false) { // did not find img
$arrRet[$i] = false;
}
else { // found img
$arrRet[$i] = true;
}
$i++;
} // end for each loop
return $arrRet;
} // end if count
} // end hasImages func
?>
[This is just an idea, I don't have enough reputation to comment...]
After calling #Olavxxx's method and filling your boolean array you could just loop through that array in an alternating manner starting in the middle: Let's assume your array is 8 entries long. Calculating the middle using your method you get 4. So you check the combination of values 4 + 3, if that doesn't work, you check 4 + 5, after that 3 + 2, ...
So your loop looks somewhat like
$middle = ceil(count($content) / 2);
$i = 1;
while ($i <= $middle) {
$j = $middle + (-1) ^ $i * $i;
$k = $j + 1;
if (!$hasImagesArray[$j] && !$hasImagesArray[$k])
break; // position found
$i++;
}
It's up to you to implement further constraints to make sure the add is not shown to far up or down in the article...
Please note that you need to take care of special cases like too short arrays too in order to prevent IndexOutOfBounds-Exceptions.
I want to paginate the following filtered results from xml file:
<?php
//load up the XML file as the variable $xml (which is now an array)
$xml = simplexml_load_file('inventory.xml');
//create the function xmlfilter, and tell it which arguments it will be handling
function xmlfilter ($xml, $color, $weight, $maxprice)
{
$res = array();
foreach ($xml->widget as $w)
{
//initially keep all elements in the array by setting keep to 1
$keep = 1;
//now start checking to see if these variables have been set
if ($color!='')
{
//if color has been set, and the element's color does not match, don't keep this element
if ((string)$w->color != $color) $keep = 0;
}
//if the max weight has been set, ensure the elements weight is less, or don't keep this element
if ($weight)
{
if ((int)$w->weight > $weight) $keep = 0;
}
//same goes for max price
if ($maxprice)
{
if ((int)$w->price > $maxprice) $keep = 0;
}
if ($keep) $res[] = $w;
}
return $res;
}
//check to see if the form was submitted (the url will have '?sub=Submit' at the end)
if (isset($_GET['sub']))
{
//$color will equal whatever value was chosen in the form (url will show '?color=Blue')
$color = isset($_GET['color'])? $_GET['color'] : '';
//same goes for these fellas
$weight = $_GET['weight'];
$price = $_GET['price'];
//now pass all the variables through the filter and create a new array called $filtered
$filtered = xmlfilter($xml ,$color, $weight, $price);
//finally, echo out each element from $filtered, along with its properties in neat little spans
foreach ($filtered as $widget) {
echo "<div class='widget'>";
echo "<span class='name'>" . $widget->name . "</span>";
echo "<span class='color'>" . $widget->color . "</span>";
echo "<span class='weight'>" . $widget->weight . "</span>";
echo "<span class='price'>" . $widget->price . "</span>";
echo "</div>";
}
}
Where $xml->widget represents the following xml:
<hotels xmlns="">
<hotels>
<hotel>
<noofrooms>10</noofrooms>
<website></website>
<imageref>oias-sunset-2.jpg|villas-agios-nikolaos-1.jpg|villas-agios-nikolaos-24.jpg|villas-agios-nikolaos-41.jpg</imageref>
<descr>blah blah blah</descr>
<hotelid>119</hotelid>
</hotel>
</hotels>
</hotels>
Any good ideas?
Honestly if you're already using XML and want to do Pagination then use XSL. It'll allow for formatting of the results and for pagination with ease. PHP has a built in XSL transformer iirc
See http://www.codeproject.com/Articles/11277/Pagination-using-XSL for a decent example.
Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.
I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?
If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);
Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}
i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.
I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}