PHP DOMDocument does not keep numeric presentation of HTML special characters

PHP DOMDocument does not keep numeric presentation of HTML special characters - php

I have a DOMDocument and would like to append some nodes.
In one of the nodes, I would like to put:
$copyrightStatementText = "© This is the CopyRight";
The problem is that the function:
$copyrightStatement = $dom_output->createElement('copyright-statement', $copyrightStatementText);
Is converting the © immediately to ©.
My goal is to keep the ©
Any idea how could I do that?

From DOMDocument::createElement():
Note:
The value will not be escaped. Use DOMDocument::createTextNode() to create a text node with escaping support.
So use DOMDocument::createTextNode() instead:
$copyrightString = "© This is the Copyright";
$copyrightNode = $dom_output->createTextNode($copyrightString);
$copyrightContainer = $dom_output->createElement('copyright-statement');
$copyrightContainer->appendChild($copyrightNode);

Related

How to handle special HTML characters in DOMDocument?

Let's say I build an HTML fragment using the following code:
$dom = new DOMDocument();
$header = $dom->createElement("h2", "Lorem & Ipsum");
$dom->appendChild($header);
print($dom->saveHTML());
The raw HTML code printed contains the unescaped & symbol instead of the necessary HTML &. The code also throws the following PHP error:
Warning: DOMDocument::createElement(): unterminated entity reference
What's the best way to handle this?

It appears that the PHP team is not willing to change this behavior (source), so we have to find a workaround instead.
One way is to simply do the encoding yourself in the PHP code, as such:
$header = $dom->createElement("h2", "Lorem & Ipsum");
However, this isn't always convenient, as the text printed may be inside of a variable or contain other special characters besides &. So, you can use the htmlentities function.
$text = "Lorem & Ipsum";
$header = $dom->createElement("h2", htmlentities($text));
If this still is not an ideal solution, another workaround is to use the textContent property instead of the second argument in createElement.
In the code below, I've implemented this in a DOMDocument subclass, so you just have to use the BetterDOM subclass instead to fix this strange bug.
class BetterDOM extends DOMDocument {
public function createElement($tag, $text = null) {
$base = parent::createElement($tag);
$base->textContent = $text;
return $base;
}
}
// Correctly prints "<h2>Lorem & Ipsum</h2>" with no errors
$dom = new BetterDOM();
$header = $dom->createElement("h2", "Lorem & Ipsum");
$dom->appendChild($header);
print($dom->saveHTML());

get wrapping element using preg_match php

I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.

It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}

The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g

Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.

How to Ignore Whitespaces using preg_match()

I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?

Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;

I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);

Ignore html tags in preg_replace

How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!

I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

Extract node from XML like data without extra PHP libs

I am returned the following:
<links>
<image_link>http://img357.imageshack.us/img357/9606/48444016.jpg</image_link>
<thumb_link>http://img357.imageshack.us/img357/9606/48444016.th.jpg</thumb_link>
<ad_link>http://img357.imageshack.us/my.php?image=48444016.jpg</ad_link>
<thumb_exists>yes</thumb_exists>
<total_raters>0</total_raters>
<ave_rating>0.0</ave_rating>
<image_location>img357/9606/48444016.jpg</image_location>
<thumb_location>img357/9606/48444016.th.jpg</thumb_location>
<server>img357</server>
<image_name>48444016.jpg</image_name>
<done_page>http://img357.imageshack.us/content.php?page=done&l=img357/9606/48444016.jpg</done_page>
<resolution>800x600</resolution>
<filesize>38477</filesize>
<image_class>r</image_class>
</links>
I wish to extract the image_link in PHP as simply and as easily as possible. How can I do this?
Assume, I can not make use of any extra libs/plugins for PHP. :)
Thanks all

At Josh's answer, the problem was not escaping the "/" character. So the code Josh submitted would become:
$text = 'string_input';
preg_match('/<image_link>([^<]+)<\/image_link>/gi', $text, $regs);
$result = $regs[0];
Taking usoban's answer, an example would be:
<?php
// Load the file into $content
$xml = new SimpleXMLElement($content) or die('Error creating a SimpleXML instance');
$imagelink = (string) $xml->image_link; // This is the image link
?>
I recommend using SimpleXML because it's very easy and, as usoban said, it's builtin, that means that it doesn't need external libraries in any way.

You can use SimpleXML as it is built in PHP.

use regular expressions
$text = 'string_input';
preg_match('/<image_link>([^<]+)</image_link>/gi', $text, $regs);
$result = $regs[0];

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP DOMDocument does not keep numeric presentation of HTML special characters - php

Related

How to handle special HTML characters in DOMDocument?

get wrapping element using preg_match php

How to Ignore Whitespaces using preg_match()

Ignore html tags in preg_replace

Extract node from XML like data without extra PHP libs

Categories

Resources