DOMDocument : how to get inner HTML as Strings separated by line-breaks? - php

<blockquote>
<p>
2 1/2 cups sweet cherries, pitted<br>
1 tablespoon cornstarch <br>
1/4 cup fine-grain natural cane sugar
</p>
</blockquote>
hi , i want to get the text inside 'p' tag . you see there are three different line and i want to print them separately after adding some extra text with each line . here is my code block
$tags = $dom->getElementsByTagName('blockquote');
foreach($tags as $tag)
{
$datas = $tag->getElementsByTagName('p');
foreach($datas as $data)
{
$line = $data->nodeValue;
echo $line;
}
}
main problem is $line contains the full text inside 'p' tag including 'br' tag . how can i separate the three lines to treat them respectively ??
thanks in advance.

You can do that with XPath. All you have to do is query the text nodes. No need to explode or something like that:
$dom = new DOMDocument;
$dom->loadHtml($html);
$xp = new DOMXPath($dom);
foreach ($xp->query('/html/body/blockquote/p/text()') as $textNode) {
echo "\n<li>", trim($textNode->textContent);
}
The non-XPath alternative would be to iterate the children of the P tag and only output them when they are DOMText nodes:
$dom = new DOMDocument;
$dom->loadHtml($html);
foreach ($dom->getElementsByTagName('p')->item(0)->childNodes as $pChild) {
if ($pChild->nodeType === XML_TEXT_NODE) {
echo "\n<li>", trim($pChild->textContent);
}
}
Both will output (demo)
<li>2 1/2 cups sweet cherries, pitted
<li>1 tablespoon cornstarch
<li>1/4 cup fine-grain natural cane sugar
Also see DOMDocument in php for an explanation of the node concept. It's crucial to understand when working with DOM.

You can use
$lines = explode('<br>', $data->nodeValue);

here is a solution in javascript syntax
var tempArray = $line.split("<br>");
echo $line[0]
echo $line[1]
echo $line[2]

You can use the php explode function like this. (assuming each line in your <p> tag ends with <br>)
$tags = $dom->getElementsByTagName('blockquote');
foreach($tags as $tag)
{
$datas = $tag->getElementsByTagName('p');
foreach($datas as $data)
{
$contents = $data->nodeValue;
$lines = explode('<br>',$contents);
foreach($lines as $line) {
echo $line;
}
}
}

Related

Split HTML document into words and spans using PHP

Using PHP I want to split an HTML document into its individual words, but keeping certain <span>s together. This is as close as I've got so far, with a minimal example of HTML (that would be larger and more complex in reality):
$html = '<html><body>
<h1>My header</h1>
<p>A test <b>paragraph</b> with <span itemscope itemtype="http://schema.org/Person">Bob Ferris</span> a person.</p>
</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach($xpath->query('.//span[#itemtype]|.//text()[normalize-space()]') as $node) {
echo $node->nodeType . " " . $node->nodeValue . "<br>";
}
This outputs:
3 My header
3 A test
3 paragraph
3 with
1 Bob Ferris
3 Bob Ferris
3 a person.
(nodeType 3 is a text node, 1 is an element)
I also need to:
Split text nodes into individual words and strip punctuation (easily done at this stage, but could it be done in the xpath query?)
Only capture the "Bob Ferris" element, and not the "Bob Ferris" text node.
I will need to access the attributes of these <span>s too, with $node->getAttribute()
This seems to do it:
// 1: Match all <span>s with an itemtype attribute.
// 2: OR
// 3: Match text strings that are not in one of those spans (and get rid of some spaces).
foreach($xpath->query('.//span[#itemtype]|.//text()[not(parent::span[#itemtype])][normalize-space()]') as $node) {
if ($node->nodeType == 1) {
// A span.
echo $node->nodeValue . "<br>";
} else {
// A text node - split into words and trim trailing periods.
$words = explode(" ", trim($node->nodeValue));
foreach($words as $word) {
echo rtrim($word, ".") . "<br>";
}
}
}
Just for fun, a one liner with XPath 2.0 :
tokenize(replace(replace(concat(string-join((//text()[not(parent::span)][normalize-space()])[position()<last()]|//span[#itemtype],","),replace((//text()[not(parent::span)][normalize-space()])[last()],"\W$","")),"\W+",","),replace(//span[#itemtype]/text(),"\W+",","),//span[#itemtype]/text()),",+")
Output :

Match multiple results single line php regex

I would like to match multiple results on a single line string but I am only able to get the last iteration on the result I excpected.
For example I have this string : <ul><li>test1</li><li>test2</li>test3</li></ul>
I would like to get :
test1
test2
test3
As result but I only get "test3"
I used this regex <ul>(<li><a.*>(.*)<\/a><\/li>)*<\/ul> on : https://regex101.com/ but I don't know what I did wrong.
Use a parser instead:
<?php
$html = <<<DATA
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
</ul>
DATA;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DomXPath($dom);
$links = $xpath->query("//li/a");
foreach ($links as $link) {
echo $link->textContent;
}
?>
This sets up the DOM and uses an xpath expression to get the element(s).
Try like this:
(?<=(<a href="#">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))
or
(?<=(">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))
or
(?<=(<a href="#">))(.)+?(?=(<\/a>))
link with example:
https://regex101.com/r/MHnxxh/1
or
https://regex101.com/r/MHnxxh/2
<?php
$str = '
<ul>
<li>test1</li>
<li>test2</li>
<li>test3</li>
</ul>
';
preg_match_all('/(?<=(#">))([\s\S]| |w\[0-9]| )+?(?=(<\/a>))/', $str, $matches);
// display array if need
echo "<pre>";
print_r($matches);
// display list
foreach ($matches[0] as $key => $value) {
echo $value ."\r\n";
}
?>
preg_match_all("\#\"\>[a-z]\w+\<\/\a\>,
$out, PREG_PATTERN_ORDER)
this the regex pattern....try this
("#\">[a-z]\w+\</\a>)
this will extract only all text strings....
you cane use of preg_replace
$test = '<ul><li>test1</li><li>test2</li>test3</li></ul>';
echo preg_replace('/<[^>]*>/', ' ', $test);

need regular expression for li

How can I get the Strings between my li tags in php? I have tried many php code but it does not work.
<li class="release">
<strong>Release info:</strong>
<div>
How.to.Train.Your.Dragon.2.2014.All.BluRay.Persian
</div>
<div>
How.to.Train.Your.Dragon.2.2014.1080p.BRRip.x264.DTS-JYK
</div>
<div>
How.to.Train.Your.Dragon.2.2014.720p.BluRay.x264-SPARKS
</div>
</li>
you can try this
$myPattern = "/<li class=\"release\">(.*?)<\/li>/s";
$myText = '<li class="release">*</li>';
preg_match($myPattern,$myText,$match);
echo $match[1];
You don't need a regular expression. It seems to be a common mistake to use regular expressions to parse HTML code (I took the URL from T.J. Crowder comment).
Use a tool to parse HTML, for instance: DOM library.
This is a solution to get all strings (I'm assuming those are the values of the text nodes):
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//li//text()');
$strings = array();
foreach($nodes as $node) {
$string = trim($node->nodeValue);
if( $string !== '' ) {
$strings[] = trim($node->nodeValue);
}
}
print_r($strings); outputs:
Array
(
[0] => Release info:
[1] => How.to.Train.Your.Dragon.2.2014.All.BluRay.Persian
[2] => How.to.Train.Your.Dragon.2.2014.1080p.BRRip.x264.DTS-JYK
[3] => How.to.Train.Your.Dragon.2.2014.720p.BluRay.x264-SPARKS
)

Extract value from href tag in table using php

I have a table with a td like below. I want to extract the value "abl" the value of symbol from href tag.
<td>
Ace Bank Limited
</td>
I can simply extract Ace Bank Limited using $td->nodeValue; but how can I extract abl using php only?
try with DOM
$html = '<td>Ace Bank Limited</td>';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
$anchor = $tag->getAttribute('href');
$text = explode('=', $anchor);
echo $text[1]; //ABL
}
or using preg_match
preg_match('/=([^\"]+)/', $html, $matches);
echo $matches[1]; //ABL
Try with Regex:- preg_match(/symbol=([^\"]+)/, $table_data, $matched)

PHP preg_match_all html tags content search

I have a problem, i want to do count of symbols in html tags in text.
Text example 1:
Hello <b>world</b>, <i>stackoverflow</i>
Text example 2:
Hello <b>world, <i>stackoverflow</i></b>
So, I need to count how many symbols in b and in i block separately.
I did this:
preg_match_all('#<(b|i)>(.*)<\/(b)>#Uusi', $temp, $tags_check);
foreach($tags_check[2] as $val)
{
if(mb_strlen($val) > 50)
{
$errors = 'error';
break;
}
}
But it`s works only for first example, in second example i need to do something with regexp. I need to search on start b and on end b, but not on start b and on end i, how can i do this?
DOM + XPath way to accomplish that:
$html = 'Hello <b>world</b>, <i>stackoverflow</i>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$error_nodes = $xpath->query('//b[string-length(text()) > 50]|//i[string-length(text()) > 50]');
foreach ($error_nodes as $node) {
print $node->nodeValue;
}
Good luck!

Categories