Php Simple HTML DOM Parser - how to work with block repeats

Php Simple HTML DOM Parser - how to work with block repeats - php

I have blocks of <h2> but without attributes. After that go blocks of <p> without attributes.
Structure of this looked like this:
<h2></h2>
<p></p>
<p></p>
<p></p>
<h2></h2>
<p></p>
<p></p>
<h2></h2>
<p></p>
I'm using Php Simple HTML DOM Parser. I want to get data from <h2> block, after that get all <p> to another <h2> and so on.
But all <h2> must be connected to <p> which go after them. I thought to use key => value (example <h2> => <p>,<p>,... and another <h2>) but I am not sure how to do this.
Also, I know about next_sibling(), but don't know how to use it in loop. I did 2 variables, 1st has all <h2>, 2nd has <p>. I thought it can be useful for my goal. Here is the code:
$test = file_get_html('url');
foreach($test->find('h2') as $test2) {
echo $test2 . '<br>';
foreach($test->find('p') as $test3) {
echo $test3 .'<br>';
}
}

It's not super clear what you're looking for but here's an idea to get you started:
foreach($html->find('h2') as $el){
$h2 = $el;
while($el = $el->next_sibling()){
if('p' != $el->tag) break;
// do something
}
}

Answer of my question here. I hope it can help somebody!)
`foreach ($html->find('.div') as $div)
{
if(!$next=$div->next_sibling()) continue;
if($next->tag==='h2')
{
$h2 =$next;
echo $h2;
while ($h2 = $h2->next_sibling())
{
if(!$h2->tag=='p') break;
{
$p =$h2;
echo $p;
}
}
while ($h2 = $h2->next_sibling())
{
if(!$h2->tag=='table') break;
{
$tab =$h2;
echo $tab;
}
}
while ($h2 = $h2->next_sibling())
{
if(!$h2->tag=='ul') break;
{
$ul =$h2;
echo $ul;
}
}
}
else continue;
}`

Related

How would I modify a HTML string without touching the HTML elements?

Suppose I have this string:
$test = '<p>You are such a <strong class="Stack">helpful</strong> Stack Exchange user.</p>';
And then I naively replace any instance of "Stack" with "Flack", I will get this:
$test = '<p>You are such a <strong class="Flack">helpful</strong> Flack Exchange user.</p>';
Clearly, I did not want this. I only wanted to change the actual "content" -- not the HTML parts. I want this:
$test = '<p>You are such a <strong class="Stack">helpful</strong> Flack Exchange user.</p>';
For that to be possible, there has to be some kind of intelligent parsing going on. Something which first detects and picks out the HTML elements from the string, then makes the string replacement operation on the "pure" content string, and then somehow puts the HTML elements back, intact, in the right places.
My brain has been wrestling with this for quite some time now and I can't find any reasonable solution which wouldn't be hackish and error-prone.
It strikes me that this might exist as a feature built into PHP. Is that the case? Or is there some way I could accomplish this in a robust and sane way?
I would rather not try to replace all HTML parts with ____DO_NOT_TOUCH_1____, ____DO_NOT_TOUCH_2____, etc. It doesn't seem like the right way.

You can do it as suggested by #04FS, with following recursive function:
function replaceText(DOMNode $node, string $search, string $replace) {
if($node->hasChildNodes()) {
foreach($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$child->textContent = str_replace($search, $replace, $child->textContent);
} else {
replaceText($child, $search, $replace);
}
}
}
}
As DOMDocument is a DOMNode, too, you can use it directly as a function argument:
$html =
'<div class="foo">
<span class="foo">foo</span>
<span class="foo">foo</span>
foo
</div>';
$doc = new DOMDocument();
$doc->loadXML($html); // alternatively loadHTML(), will throw an error on invalid HTML tags
replaceText($doc, 'foo', 'bar');
echo $doc->saveXML();
// or
echo $doc->saveXML($doc->firstChild);
// ... to get rid of the leading XML version tag
Will output
<div class="foo">
<span class="foo">bar</span>
<span class="foo">bar</span>
bar
</div>
Bonus: When you want to str_replace an attribute value
function replaceTextInAttribute(DOMNode $node, string $attribute_name, string $search, string $replace) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attr) {
if($attr->nodeName === $attribute_name) {
$attr->nodeValue = str_replace($search, $replace, $attr->nodeValue);
}
}
}
if($node->hasChildNodes()) {
foreach($node->childNodes as $child) {
replaceTextInAttribute($child, $attribute_name, $search, $replace);
}
}
}
Bonus 2: Make the function more extensible
function modifyText(DOMNode $node, callable $userFunc) {
if($node->hasChildNodes()) {
foreach($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$child->textContent = $userFunc($child->textContent);
} else {
modifyText($child, $userFunc);
}
}
}
}
modifyText(
$doc,
function(string $string) {
return strtoupper(str_replace('foo', 'bar', $string));
}
);
echo $doc->saveXML($doc->firstChild);
Will output
<div class="foo">
<span class="foo">BAR</span>
<span class="foo">BAR</span>
BAR
</div>

'echo' a tag without a hyphen and Proper Case using str_replace?

I have a bit of a tricky situation here....
I am pulling bookmarks from a service called Pinboard using their API's which works great - but - the category of the 'tag' (i.e. bookmark tag) is echoed out in full with hyphens.
The difficulty I am having is that I'd like the $tag in one instance to retain the hyphens (to allow for an anchor link using markup to work) - whilst - changing the same $tag that is echoed in the < h1 >
So for example one of the $tag is 'Latest-News' - and I'd like that $tag to be printed like this:
for the anchor tag: $tag will echo 'latest-news'
for the < h1 > tag: $tag will echo 'Latest News'
Any ideas how this is done?
Something like this might be on the right track (I hope! - I'm still clearly a n00b):
$str = str_replace("-", " ", $tag);
echo $tag;
+++++
include 'pinboard-api.php';
$pinboard = new PinboardAPI('myusername', 'xxxxxxx');
$bookmarks_all = $pinboard->get_all();
$bookmarks_grouped_by_tags = array();
foreach ($bookmarks_all as $bookmark) {
if (! empty($bookmark->tags) && is_array($bookmark->tags)) {
foreach ($bookmark->tags as $tag) {
$bookmarks_grouped_by_tags[$tag][] = $bookmark;
}
} else {
$bookmarks_grouped_by_tags['no_tag'][] = $bookmark;
}
}
?>
<?php foreach ($bookmarks_grouped_by_tags as $tag => $bookmarks) { ?>
<a name="<?php echo $tag ?>">
***** <h1><?php echo $tag ?></a></h1> *******
<? foreach ($bookmarks as $bookmark) { ?>
<div>
<?php echo $bookmark->title ?>
</div>
<div><?php echo $bookmark->description ?></div>
<?php } ?>
<?php } ?>
++++

Simple HTML DOM Not Finding DIV

I have code trying to extract the Event SKU from the Robot Events Page, here is an example. The code that I am using dosn't find any of the SKU on the page. The SKU is on line 411, with a div of the class "product-sku". My code doesn't event find the Div on the page and just downloads all the events. Here is my code:
<?php
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = file_get_html($event[4]);
$html->load($htmldown);
echo "Downloaded";
foreach ($html->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
?>
Can anyone help me fix my code?

This code is used DOMDocument php class. It works successfully for below sample HTML. Please try this code.
// new dom object
$dom = new DOMDocument();
// HTML string
$html_string = '<html>
<body>
<div class="product-sku1" name="div_name">The this the div content product-sku</div>
<div class="product-sku2" name="div_name">The this the div content product-sku</div>
<div class="product-sku" name="div_name">The this the div content product-sku</div>
</body>
</html>';
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//the table by its tag name
$divs = $dom->getElementsByTagName('div');
// loop over the all DIVs
foreach ($divs as $div) {
if ($div->hasAttributes()) {
foreach ($div->attributes as $attribute){
if($attribute->name === 'class' && $attribute->value == 'product-sku'){
// Peri DIV class name and content
echo 'DIV Class Name: '.$attribute->value.PHP_EOL;
echo 'DIV Content: '.$div->nodeValue.PHP_EOL;
}
}
}
}

I would use a regex (regular expression) to accomplish pulling skus out.
The regex:
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
See php regex docs.
New code:
<?php
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = curl_init($event[4]);
curl_setopt($htmldown, CURLOPT_RETURNTRANSFER, true);
$html=curl_exec($htmldown);
curl_close($htmldown)
echo "Downloaded";
preg_match('~<div class="product-sku"><b>Event Code:</b>(.*?)</div>~',$html,$matches);
foreach ($matches as $row) {
echo $row;
}
}
?>
And actually in this case (using that webpage) being that there is only one sku...
instead of:
foreach ($matches as $row) {
echo $row;
}
You could just use: echo $matches[1]; (The reason for array index 1 is because the whole regex pattern plus the sku will be in $matches[0] but just the subgroup containing the sku is in $matches[1].)

try to use
require('simple_html_dom.php');
$html = new simple_html_dom();
if(!$events)
{
echo mysqli_error($con);
}
while($event = mysqli_fetch_row($events))
{
$htmldown = str_get_html($event[4]);
echo "Downloaded";
foreach ($htmldown->find('div[class=product-sku]') as $row) {
$sku = $row->plaintext;
echo $sku;
}
}
and if class "product-sku" is only for div's then you can use
$htmldown->find('.product-sku')

PHP Simple HTML DOM Parser: Accessing custom attributes

I want to access a custom attribute that I added to some elements in an HTML file, here's an example of the littleBox="somevalue" attribute
<div id="someId" littleBox="someValue">inner text</div>
The Following doesn't work:
foreach($html->find('div') as $element){
echo $element;
if(isset($element->type)){
echo $element->littleBox;
}
}
I saw an article with a similar problem, but I couldn't replicate it for some reason. Here is what I tried:
function retrieveValue($str){
if (stripos($str, 'littleBox')){//check if element has it
$var=preg_split("/littleBox=\"/",$str);
//echo $var[1];
$var1=preg_split("/\"/",$var[1]);
echo $var1[0];
}
else
return false;
}
When ever I call the retrieveValue() function, nothing happens. Is $element (in the first PHP example above) not a string? I don't know if I missed something but it's not returning anything.
Here's the script in it's entirety:
<?php
require("../../simplehtmldom/simple_html_dom.php");
if (isset($_POST['submit'])){
$html = file_get_html($_POST['webURL']);
// Find all images
foreach($html->find('div') as $element){
echo $element;
if(isset($element->type)!= false){
echo retrieveValue($element);
}
}
}
function retrieveValue($str){
if (stripos($str, 'littleBox')){//check if element has it
$var=preg_split("/littleBox=\"/",$str);
//echo $var[1];
$var1=preg_split("/\"/",$var[1]);
return $var1[0];
}
else
return false;
}
?>
<form method="post">
Website URL<input type="text" name="webURL">
<br />
<input type="submit" name="submit">
</form>

Have you tried:
$html->getElementById("someId")->getAttribute('littleBox');
You could also use SimpleXML:
$html = '<div id="someId" littleBox="someValue">inner text</div>';
$dom = new DOMDocument;
$dom->loadXML($html);
$div = simplexml_import_dom($dom);
echo $div->attributes()->littleBox;
I would advice against using regex to parse html but shouldn't this part be like this:
$str = $html->getElementById("someId")->outertext;
$var = preg_split('/littleBox=\"/', $str);
$var1 = preg_split('/\"/',$var[1]);
echo $var1[0];
Also see this answer https://stackoverflow.com/a/8851091/1059001

See that http://code.google.com/p/phpquery/ it's like jQuery but on php. Very strong library.

PHP: Display the first 500 characters of HTML

I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?

If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);

Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}

i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.

I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Php Simple HTML DOM Parser - how to work with block repeats - php

It's not super clear what you're looking for but here's an idea to get you started: foreach($html->find('h2') as $el){ $h2 = $el; while($el = $el->next_sibling()){ if('p' != $el->tag) break; // do something } }

Related

How would I modify a HTML string without touching the HTML elements?

'echo' a tag without a hyphen and Proper Case using str_replace?

Simple HTML DOM Not Finding DIV

PHP Simple HTML DOM Parser: Accessing custom attributes

PHP: Display the first 500 characters of HTML

Categories

Resources