How would I modify a HTML string without touching the HTML elements? - php

Suppose I have this string:
$test = '<p>You are such a <strong class="Stack">helpful</strong> Stack Exchange user.</p>';
And then I naively replace any instance of "Stack" with "Flack", I will get this:
$test = '<p>You are such a <strong class="Flack">helpful</strong> Flack Exchange user.</p>';
Clearly, I did not want this. I only wanted to change the actual "content" -- not the HTML parts. I want this:
$test = '<p>You are such a <strong class="Stack">helpful</strong> Flack Exchange user.</p>';
For that to be possible, there has to be some kind of intelligent parsing going on. Something which first detects and picks out the HTML elements from the string, then makes the string replacement operation on the "pure" content string, and then somehow puts the HTML elements back, intact, in the right places.
My brain has been wrestling with this for quite some time now and I can't find any reasonable solution which wouldn't be hackish and error-prone.
It strikes me that this might exist as a feature built into PHP. Is that the case? Or is there some way I could accomplish this in a robust and sane way?
I would rather not try to replace all HTML parts with ____DO_NOT_TOUCH_1____, ____DO_NOT_TOUCH_2____, etc. It doesn't seem like the right way.

You can do it as suggested by #04FS, with following recursive function:
function replaceText(DOMNode $node, string $search, string $replace) {
if($node->hasChildNodes()) {
foreach($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$child->textContent = str_replace($search, $replace, $child->textContent);
} else {
replaceText($child, $search, $replace);
}
}
}
}
As DOMDocument is a DOMNode, too, you can use it directly as a function argument:
$html =
'<div class="foo">
<span class="foo">foo</span>
<span class="foo">foo</span>
foo
</div>';
$doc = new DOMDocument();
$doc->loadXML($html); // alternatively loadHTML(), will throw an error on invalid HTML tags
replaceText($doc, 'foo', 'bar');
echo $doc->saveXML();
// or
echo $doc->saveXML($doc->firstChild);
// ... to get rid of the leading XML version tag
Will output
<div class="foo">
<span class="foo">bar</span>
<span class="foo">bar</span>
bar
</div>
Bonus: When you want to str_replace an attribute value
function replaceTextInAttribute(DOMNode $node, string $attribute_name, string $search, string $replace) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attr) {
if($attr->nodeName === $attribute_name) {
$attr->nodeValue = str_replace($search, $replace, $attr->nodeValue);
}
}
}
if($node->hasChildNodes()) {
foreach($node->childNodes as $child) {
replaceTextInAttribute($child, $attribute_name, $search, $replace);
}
}
}
Bonus 2: Make the function more extensible
function modifyText(DOMNode $node, callable $userFunc) {
if($node->hasChildNodes()) {
foreach($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$child->textContent = $userFunc($child->textContent);
} else {
modifyText($child, $userFunc);
}
}
}
}
modifyText(
$doc,
function(string $string) {
return strtoupper(str_replace('foo', 'bar', $string));
}
);
echo $doc->saveXML($doc->firstChild);
Will output
<div class="foo">
<span class="foo">BAR</span>
<span class="foo">BAR</span>
BAR
</div>

Related

XML: why is my DOM traversal function yielding only the top-level node?

I thought I would write a simple function to visit all the nodes in a DOM tree. I wrote it, gave it a not-too-complex bit of XML to work on, but when I ran it I got only the top-level (DOMDocument) node.
Note that I am using PHP's Generator syntax:
http://php.net/manual/en/language.generators.syntax.php
Here's my function:
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
DOMIterate($subnode);
// }
}
}
}
And the testcase code that is supposed to print the results:
$doc = new DOMDocument();
$doc->loadXML($input);
foreach (DOMIterate($doc) as $node) {
$type = $node->nodeType;
if ($type == XML_ELEMENT_NODE) {
$tag = $node-> tagName;
echo "$tag\n";
}
else if ($type == XML_DOCUMENT_NODE) {
echo "document\n";
}
else if ($type == XML_TEXT_NODE) {
$text = $node->wholeText;
echo "text: $text\n";
} else {
$linenum = $node->getLineNo();
echo "unknown node type: $type at input line $linenum\n";
}
}
The input XML is the first 18 lines of
https://www.w3schools.com/xml/plant_catalog.xml
plus a closing
If you're using PHP7, you can try this:
<?php
$string = <<<EOS
<div level="1">
<div level="2">
<p level="3"></p>
<p level="3"></p>
</div>
<div level="2">
<span level="3"></span>
</div>
</div>
EOS;
$document = new DOMDocument();
$document->loadXML($string);
function DOMIterate($node)
{
yield $node;
if ($node->childNodes) {
foreach ($node->childNodes as $childNode) {
yield from DOMIterate($childNode);
}
}
}
foreach (DOMIterate($document) as $node) {
echo $node->nodeName . PHP_EOL;
}
Here's a working example - http://sandbox.onlinephpfunctions.com/code/ab4781870f8f988207da78b20093b00ea2e8023b
Keep in mind that you'll also get the text nodes that are contained within the tags.
Using yield in a function called from the generator doesn't return the value to the caller of the original generator. You need to use yield from to propagate the values back.
function DOMIterate($node)
{
yield $node;
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $subnode) {
// if($subnode != null) {
yield from DOMIterate($subnode);
// }
}
}
}
This requires PHP 7. If you're using an earlier version, see Recursive generators in PHP

Parse results from Zend_Dom_Query

I am trying to parse screen-scraped data using Zend_Dom_Query, but I am struggling how to apply it properly for my case, and all other answers I have seen on SO make assumptions that quite frankly scare me with their naiveté.
A typical example is How to Pass Array from Zend Dom Query Results to table where pairs of data points are being extracted from the documents body through the use of separate calls to the query() method.
$year = $dom->query('.secondaryInfo');
$rating = $dom->query('.ratingColumn');
Where the underlying assumptions are that an equal number of $year and $rating results exist AND that they are correctly aligned with each other within the document. If either of those assumptions are wrong, then the extracted data is less than worthless - in fact it becomes all lies.
In my case I am trying to extract multiple chunks of data from a site, where each chunk is nominally of the form:
<p class="main" atrb1="value1">
<a href="#1" >href text 1</a>
<span class="sub1">
<span class="span1"></span>
<span class="sub2">
<span class="span2">data span2</span>
href text 2
</span>
<span class="sub3">
<span class="span3">
<p>Some other data</p>
<span class="sub4">
<span class="sub5">More data</span>
</span>
</span>
</span>
</span>
</p>
For each chunk, I need to grab data from various sections:
".main"
".main a"
".main .span2"
".main .sub2 a"
".main .span3 p"
etc
And then process the set of data as one distinct unit, and not as multiple collections of different data.
I know I can hard code the selection of each element (and I currently do that), but that produces brittle code reliant on the source data being stable. And this week the data source yet again changed and I was bitten by my hard coded scraping failing to work. Thus I am trying to write robust code that can locate what I want without me having to care/know about the overall structure (Hmmm - Linq for php?)
So in my mind, I want the code to look something like
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $result->query(".main a");
$data2 = $result->query(".main .span2");
$data3 = $result->query(".main .sub a");
etc
if ($data1 && $data2 && $data3) {
Do something
} else {
Do something else
}
}
Is it possible to do what I want with stock Zend/PHP function calls? Or do I need to write some sort of custom function to implement $result->query()?
OK .. so I bit the bullet and wrote my own solution to the problem. This code recurses through the results from the Zend_Dom_Query and looks for matching css selectors. As presented the code works for me and has also helped clean up my code. Performance wasn't an issue for me, but as always Caveat Emptor. I have also left in some commented out code that enables visualization of where the search is leading. The code was also part of a class, hence the use of $this-> in places.
The code is used as:
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $this->domQuery($result, ".sub2 a");
if (!is_null($data1))
{
Do Something
}
}
Which finds the href text 2 element under the <span class="sub2"> element.
// Function that recurses through a Zend_Dom_Query_Result, looking for css selectors
private function recurseDomQueryResult($dom, $depth, $targets, $index, $count)
{
// Gross checking
if ($index<0 || $index >= $count) return NULL;
// Document where we are
$element = $dom->nodeName;
$class = NULL;
$id = NULL;
// $href = NULL;
// Skip unwanted elements
if ($element == '#text') return NULL;
if ($dom->hasAttributes()) {
if ($dom->hasAttribute('class'))
{
$class = trim($dom->getAttribute('class'));
}
if ($dom->hasAttribute('id'))
{
$id = trim($dom->getAttribute('id'));
}
// if ($element == 'a')
// {
// if ($dom->hasAttribute('href'))
// {
// $href = trim($dom->getAttribute('href'));
// }
// }
}
// $padding = str_repeat('==', $depth);
// echo "$padding<$element";
// if (!($class === NULL)) echo ' class="'.$class.'"';
// if (!($href === NULL)) echo ' href="'.$href.'"';
// echo '><br />'. "\n";
// See if we have a match for the current element
$target = $targets[$index];
$sliced = substr($target,1);
switch($target[0])
{
case '.':
if ($sliced === $class) {
$index++;
}
break;
case '#':
if ($sliced === $id) {
$index++;
}
break;
default:
if ($target === $element) {
$index++;
}
break;
}
// Check for having matched all
if ($index == $count) return $dom;
// We didn't have a match at this level
// So recursively look at all the children
$children = $dom->childNodes;
if ($children) {
foreach($children as $child)
{
if (!is_null(($result = $this->recurseDomQueryResult($child, $depth+1, $targets, $index, $count)))) return $result;
}
}
// Did not find anything
// echo "$padding</$element><br />\n";
return NULL;
}
// User function that you call to find a single element in a Zend_Dom_Query_Result
// $dom is the Zend_Dom_Query_Result object
// $path is a path of css selectors, e.g. ".sub2 a"
private function domQuery($dom, $path)
{
$depth = 0;
$index = 0;
$targets = explode(' ', $path);
$count = count($targets);
return $this->recurseDomQueryResult($dom, $depth, $targets, $index, $count);
}

How to get plain text inside body tag using dom..and get the words into an array?

I want to get the contents inside body tag..seperate them as words and get the words into an array..am using php
This is what i have done
$content=file_get_contents($_REQUEST['url']);
$content=html_entity_decode($content);
$content = preg_replace("/&#?Ã[a-z0-9]+;/i"," ",$content);
$dom = new DOMDocument;
#$dom->loadHTML($content);
$tags=$dom->getElementsByTagName('body');
foreach($tags as $h)
{
echo "<li>".$h->tagName;
getChilds2($h);
function getChilds2($node)
{
if($node->hasChildNodes())
{
foreach($node->childNodes as $c)
{
if($c->nodeType==3)
{
$nodeValue=$c->nodeValue;
$words=feature_node($c,$nodeValue,true);
if($words!=false)
{
$_ENV["words"][]=$words;
}
else if($c->tagName!="")
{
getChilds2($c);
}
}
}
}
else
{
return;
}
}
function feature_node($node,$content,$display)
{
if(strlen($content)<=0)
{
return;
}
$content=strtolower($content);
$content=mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
$content= drop_script_tags($content);
$temp=$content;
$content=strip_punctuation($content);
$content=strip_symbols($content);
$content=strip_numbers($content);
$words_after_noise_removal=mb_split( ' +',$content);
$words_after_stop_words_removal=remove_stop_words($words_after_noise_removal);
if(count($words_after_stop_words_removal)==0)
return(false);
$i=0;
foreach($words_after_stop_words_removal as $w)
{
$words['word'][$i]=$w;
$i++;
}
for($i=0;$i<sizeof($words['word']);$i++)
{
$words['stemmed'][$i]= PorterStemmer::Stem($words['word'][$i],true)."<br/>";
}
return($words);
}
Here i have used some functions like strip_punctuation,strip_symbols,strip_numbers,remove stop_words and porterstemmer for preprocessing of the page..they ar eworking fine..but am not getting the contents into array and print_r() or echo gives nothing..help plz?
You dont have to to iterate over the nodes.
$tags = $dom->getElementsByTagName('body');
will give you just one result in the DOMNodeList. So all you need to do to get the text is
$plainText = $tags->item(0)->nodeValue;
or
$plainText = $tags->item(0)->textContent;
To get the separate words into an array, you can use
str_word_count — Return information about words used in a string
on the resulting $plainText then

(PHP) Regex for finding specific href tag

i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<span ....>lorem ipsum</span>
<span ....>example</span>
example3
<img ...>test</img>
without a d as target url
As you can see the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url:
"d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!
Here you go:
$html = array
(
'<span ....>lorem ipsum</span>',
'<span ....>example</span>',
'example3',
'<img ...>test</img>',
'without a d as target url',
);
$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');
foreach ($anchors as $anchor)
{
if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
{
$result[] = strval($anchor['href']);
}
}
echo '<pre>';
print_r($result);
echo '</pre>';
Output:
Array
(
[0] => http://www.example.com/d?12345abc
[1] => http://www.example.com/d/d.1234
)
The phXML() function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:
function phXML($xml, $xpath = null)
{
if (extension_loaded('libxml') === true)
{
libxml_use_internal_errors(true);
if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
{
if (is_string($xml) === true)
{
$dom = new DOMDocument();
if (#$dom->loadHTML($xml) === true)
{
return phXML(#simplexml_import_dom($dom), $xpath);
}
}
else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
{
if (isset($xpath) === true)
{
$xml = $xml->xpath($xpath);
}
return $xml;
}
}
}
return false;
}
I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.
Here is a Regular Expression which works:
$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);
The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:
example3<img ...>test</img>
Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.
There's a good list of them here:
Robust and Mature HTML Parser for PHP
Will print only first and fourth link because two conditions are met.
preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);
for($i = 0; $i < $count; $i++){
if(
strpos($matches[1][$i], '/d') !== false
&&
preg_match('#(lorem|test)#is', $matches[3][$i]) == true
)
{
echo $matches[1][$i];
}
}

PHP: Display the first 500 characters of HTML

I have a huge HTML code in a PHP variable like :
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length.
but while triming the code, it should not affect DOM structure of HTML code.
Is there any tuorial or working examples available?
If its the text you want, you can do this with the following too
substr(strip_tags($html_code),0,500);
Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT
http://www.php.net/manual/en/class.domdocument.php
then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)
This won't be exactly right, but hopefully this will steer you onto the right track.
Try something like:
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$dom = new DOMDocument();
$dom->loadHTML($html_code);
$text_to_strip = $dom->textContent;
$stripped = mb_substr($text_to_strip,0,500);
echo "$stripped"; // The Sameple text.Another sample text.....
edit ok... that should work. just tested locally
edit2
Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)
First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.
if (strlen($stripped) > 500) {
// this is where we do our work.
$characters_so_far = 0;
foreach ($dom->child_nodes as $ChildNode) {
// should check if $ChildNode->hasChildNodes();
// probably put some of this stuff into a function
$characters_in_next_node += str_len($ChildNode->textcontent);
if ($characters_so_far+$characters_in_next_node > 500) {
// remove the node
// try using
// $ChildNode->parentNode->removeChild($ChildNode);
}
$characters_so_far += $characters_in_next_node
}
//
$final_out = $dom->saveHTML();
} else {
$final_out = $html_code;
}
i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.
class HtmlWordManipulator
{
var $stack = array();
function truncate($text, $num=50)
{
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text);
$words = 0;
$out = array();
$text = str_replace('<',' <',str_replace('>','> ',$text));
$toks = preg_split('/\s+/', $text);
foreach ($toks as $tok)
{
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))
foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);
$out[] = trim($tok);
if (! preg_match('/^(<[^>]+>)+$/', $tok))
{
if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0)
{
++$words;
}
else
{
/*
echo '<hr />';
echo htmlentities('failed: '.$tok).'<br /)>';
echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
echo str_word_count($text);
*/
}
}
if ($words > $num) break;
}
$truncate = $this->_truncateRestore(implode(' ', $out));
return $truncate;
}
function restoreTags($text)
{
foreach ($this->stack as $tag) $text .= "</$tag>";
return $text;
}
private function _truncateProtect($match)
{
return preg_replace('/\s/', "\x01", $match[0]);
}
private function _truncateRestore($strings)
{
return preg_replace('/\x01/', ' ', $strings);
}
private function _recordTag($tag, $args)
{
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') return;
else if ($tag[0] == '/')
{
$tag = substr($tag, 1);
for ($i=count($this->stack) -1; $i >= 0; $i--) {
if ($this->stack[$i] == $tag) {
array_splice($this->stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a')))
$this->stack[] = $tag;
else return;
}
}
truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.
please don't judge me on the complete lack of oop principles. i was young and stupid.
edit:
so it turns out the usage is more like this:
$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));
stupid design decision. allowed me to inject html inside the unclosed tags though.
I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):
$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';
$document = XMLParser($html_code);
foreach ($document->getElementsByTagName('*') as $element) {
$aggregate .= $element->text(); // This is the text, not HTML. It doesn't
// include the children, only the text
// directly in the tag.
}

Categories