How to get "innerContent" with DOMdocument? [duplicate] - php

<blockquote>
<p>
2 1/2 cups sweet cherries, pitted<br>
1 tablespoon cornstarch <br>
1/4 cup fine-grain natural cane sugar
</p>
</blockquote>
hi , i want to get the text inside 'p' tag . you see there are three different line and i want to print them separately after adding some extra text with each line . here is my code block
$tags = $dom->getElementsByTagName('blockquote');
foreach($tags as $tag)
{
$datas = $tag->getElementsByTagName('p');
foreach($datas as $data)
{
$line = $data->nodeValue;
echo $line;
}
}
main problem is $line contains the full text inside 'p' tag including 'br' tag . how can i separate the three lines to treat them respectively ??
thanks in advance.

You can do that with XPath. All you have to do is query the text nodes. No need to explode or something like that:
$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('/html/body/blockquote/p/text()') as $textNode) {
echo "\n<li>", trim($textNode->textContent);
}
The non-XPath alternative would be to iterate the children of the P tag and only output them when they are DOMText nodes:
$dom = new DOMDocument;
$dom->loadHtml($html);
foreach ($dom->getElementsByTagName('p')->item(0)->childNodes as $pChild) {
if ($pChild->nodeType === XML_TEXT_NODE) {
echo "\n<li>", trim($pChild->textContent);
}
}
Both will output (demo)
<li>2 1/2 cups sweet cherries, pitted
<li>1 tablespoon cornstarch
<li>1/4 cup fine-grain natural cane sugar
Also see DOMDocument in php for an explanation of the node concept. It's crucial to understand when working with DOM.

You can use
$lines = explode('<br>', $data->nodeValue);

here is a solution in javascript syntax
var tempArray = $line.split("<br>");
echo $line[0]
echo $line[1]
echo $line[2]

You can use the php explode function like this. (assuming each line in your <p> tag ends with <br>)
$tags = $dom->getElementsByTagName('blockquote');
foreach($tags as $tag)
{
$datas = $tag->getElementsByTagName('p');
foreach($datas as $data)
{
$contents = $data->nodeValue;
$lines = explode('<br>',$contents);
foreach($lines as $line) {
echo $line;
}
}
}

Related

Is there a way to match words to sentences inside a html <b> tag in PHP

So i have this code to extract the text between in b tags.
$source_url = "https://www.wordpress.com/";
$html = file_get_contents($source_url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('b');
$words = "php";
echo "<pre>";
print_r($dom);
echo "</pre>";
I tried to put the text inside in an array using array_push and others but if im going to use in_array
i need to put the whole sentence to return true not only a word.
So what i want exactly is :
If that sentence contains 'php' then return true
Try This:
foreach($links as $link) {
$p = strtolower($link->nodeValue);
if (strpos($p, 'php') !== false) {
// do something
}
}

Removing every li tag before reaching the first p tag in string

Suppose I have a string containing some HTML. I want to remove every li tag before reaching the first p tag.
How do I achieve something like that?
Example string:
$str = "<img src='something.png'/>some_text_here<li>needs_to_be_removed</li>
<li>also_needs_to_be_removed</li>some_other_text<p>finally</p>more_text_here
<li>this_should_not_be_removed</li>";`
The first two li tags need to be removed.
here is what you need. Simple and effective:
$mystring = "mystringwith<li>toberemovedstring</li><li>againremove</li><p>do not remove me</p>";//the string you provide
$findme = '<li>';//the string you want to search in $mystring
$findpee = '<p>';//haha pee also where to end it
$pos = strpos($mystring, $findme);//first position of <li>
$pospee = strpos($mystring, $findpee);// then position of pee.. get it :)
//Then we remove it
$result=substr_replace ( $mystring ,"" , $pos, ($pospee-$pos));
echo $result;
Edit: PHP sandbox
http://sandbox.onlinephpfunctions.com/code/e534259e2312682a04b64c6e3aae1521422aacd2
you can check the result here as well
You can do it with PHP's DOMdocument using the below traversal function
$doc = new DOMDocument();
$doc->loadHTML($str);
$foundp = false;
showDOMNode($doc);
//now $doc contains the string you want
$newstr = $doc->saveHTML();
function showDOMNode(DOMNode &$domNode) {
global $foundp;
foreach ($domNode->childNodes as $node)
{
if ($node->nodeName == "li" && $foundp==false){
//delete this node
$domNode->removeChild($node);
}
else if ($node->nodeName == "p"){
//stop here
$foundp = true;
return;
}
else if($node->hasChildNodes() && $foundp==false) {
//recursively
showDOMNode($node);
}
}
}
With XPath:
$str = "<img src='something.png'/>some_text_here<li>needs_to_be_removed</li>
<li>also_needs_to_be_removed</li>some_other_text<p>finally</p>more_text_here
<li>this_should_not_be_removed</li>";
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $str .'</div>', LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// ^---------------^----- add a root element
$xp = new DOMXPath($dom);
$lis = $xp->query('//p[1]/preceding-sibling::li');
foreach ($lis as $li) {
$li->parentNode->removeChild($li);
}
$result = '';
// add each child node of the root element to the result
foreach ($dom->getElementsByTagName('div')->item(0)->childNodes as $child) {
$result .= $dom->saveHTML($child);
}
I would suggest using a php praser library will be much better and faster approach. I personally use this one https://github.com/paquettg/php-html-parser in my projects. it provides apis like
$child->nextSibling()
$content->innerHtml,
$content->firstChild()
and more which can come in handy.
You can just do a foreach loop for all elements, register "li" tag inside them and if for third occurance, you find a "p" tag, you can just delete the $child->previousSibling();

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

Stripping <ins> and <del> tags from <script> tags

I have some code that is generating a diff between two documents, inserting <ins> and <del> tags haphazardly. For the most part it's doing a great job, but every now and then it inserts tags in script, style and the title tags.
Any ideas on how to remove the <del> tags (including the text between them), remove the <ins> tags (but retaining the text within them as part of the original string), however only within those three tags? (title, script and style).
Don't use regex to do this; it sounds like you have to deal with many, many lines. DOMDocument is great.
$dom = new DOMDocument;
$dom->loadHTML($your_html_string);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
$node->removeChild($delNode);
}
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
}
}
Untested, this may or may not work:
$str = preg_replace('/(<script.*?>.*?)<del>.*?</del>(.*?</script>)/im', '$1$2', $str);
It attempts to look within the <script> ... </script> block of the string, and replace any instances of <del>...</del> with empty string.
The following ended up working quite well for me:
$tags = array('script', 'title', 'style');
foreach ($tags as $tag) {
$str = preg_replace_callback(
'/(<' . ($tag) . '\b[^>]*>)(.*?)(<\/' . ($tag) . '>)/is',
function($match) {
$replaced = preg_replace(
array(
'/__Delete-Start__.+__Delete-End__/Uis',
'/__Insert-Start__(.+)__Insert-End__/Uis'
),
array(
'',
'$1'
),
$match[2]
);
return ($match[1]) . ($replaced) . ($match[3]);
},
$str
);
}
While the following didn't end up being my solution, it did get me far and could be useful to others:
$dom = new DOMDocument;
$dom->loadHTML($str);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
$node->removeChild($delNode);
}
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
}
}
$str = (string) $dom->saveXML($dom, LIBXML_NOEMPTYTAG);//$xpath->query('//p')->item(0));
Hope this helps someone else.

php preg_replace need help

I have created a function to search through strings and replace keywords in those strings with links. I am using
preg_replace('/\b(?<!=")(?<!=\')(?<!=)(?<!=")(?<!>)(?<!>)' . $keyword . '(?!</a)(?!</a)\b', $newString, $row);
which is working as expected. The only issue is that if someone had a link like this
Luxury Automobile sales
Automobile being our $keyword in this example.
It would end up looking like
Luxury <a href="www.domain.tdl/keywords.html">Automobile Sales</a>
You can understand my frustration.
Not being confident in regex I thought I would ask if anyone here would know a solution.
Thanks!
How about a proper HTML parser like DOMDocument?
$html = 'Luxury Automobile sales';
$dom = new DomDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node)
{
$node->nodeValue = str_replace('Automobile', 'Cars', $node->nodeValue);
echo simplexml_import_dom($node)->asXML();
}
Is not a problem to get element attribute too
foreach ($nodes as $node)
{
$attr = $node->getAttributeNode('href');
$attr->value = str_replace('Automobile', 'keyword', $attr->value);
echo simplexml_import_dom($node)->asXML();
}

Categories