How to stop PHP Domdocument::SaveXML from inserting "CDATA"?

How to stop PHP Domdocument::SaveXML from inserting "CDATA"? - php

I'm using PHP to get all the "script" tags from web pages, and then appending text after the </script> that is not always valid html. Because it's not always valid markup I can't just use appendchild/replacechild to add that information, unless I'm misunderstanding how replacechild works.
Anyway, when I do
$script_tags = $doc->getElementsByTagName('script');
$l = $script_tags->length;
for ($i = $l - 1; $i > -1; $i--)
$script_tags_string = $doc->saveXML($script_tags->item($i));
This puts "<![CDATA[" and "]]>" around the contents of the script tag. How can I disable this? Please don't tell me to just delete it afterwards, that's what I'm going to do if I can't find a solution for this.

I have a suspicion that the CDATA is inserted because it would otherwise be invalid XML.
Have you tried using saveHTML instead of saveXML?

One way I've found to fix this:
Before echoing the document, make a loop around all script tags, and use str_replace for "<", ">" to some string, make sure to only use that string inside script tags.
Then, use the method saveXML() in a variable, and finally use str_replace replacing "STRING" to "<" or ">"
Here is the code:
<?php
//First loop
foreach($dom->getElementsByTagName('script') as $script){
$script->nodeValue = str_replace("<", "ESCAPE_CHAR_LT", $script->nodeValue);
$script->nodeValue = str_replace(">", "ESCAPE_CHAR_GT", $script->nodeValue);
}
//Obtaining XHTML
$output = $dom->saveXML();
//Seccond replace
$output = str_replace("ESCAPE_CHAR_LT", "<", $output);
$output = str_replace("ESCAPE_CHAR_GT", ">", $output);
//Print document
echo $output;
?>
As you can see, now you are free to use "<" ">" in your scripts.
Hope this helps someone.

Related

How do you remove the new line code ( ) with php?

Ok, so I was tasked with created a gallery, using a sql table is not an option, so I am doing what I can. This is my code, wich works fine, but it generates a hidden character at the end of every imate.
<?php
$photos = file("/elements/photos.php");
for ($i = 0; $i < count($photos); $i++) {
$allimages .= $imagefile = '<img src="/elements/photos/'.$photos[$i].'">';};
?>
<?=$allimages?>
This is the code that it generates
<img src="/elements/photos/t/a_little_kitten.jpg
">
I have been unable to find what
this means, I believe it means "blank space" or "new line", but I cannot find it.
This is the code I have tried, but it does not work either.
$allimages = preg_replace('/\s\s+/', ' ', $allimages)
Please help.
Below is the php file I am pulling the image names from. There is no code in this file, just text.
a_little_kitten.jpg
black_cat.jpg
basket.jpg

Try rtrim link or trim to remove the whitespace. As I can see that there is a whitespace at the end of your a_little_kitten.jpg and black_cat.jpg file.
&#10 represents a line feed. Maybe you can use str_replace()
ex: str_replace(array("\n", "\r"), '', $photos) before for loop.

unable to understand how to match all characters except a given sequence with preg_replace() in php

So what I am trying to do is to match a regular expression which has an opening <p>; tag and a closing &lt/;p> tag.This is the code I wrote:
<?php
$input = "<p&gtjust some text</p&gt more text!";
$input = preg_replace('/<p&gt[^(<\/p&gt)]+?&lt\/;p&gt/','<p>$1</p>',$tem);
echo $input;
?>
So the code does not seem to replace <p&gt with <p> or replace </p&gt with </p>.I think the problem is in the part where I am checking all characters expect '</p&gt. I don't think the code [^(<\/p&gt)] is grouping all the characters correctly. I think it checks if any of the characters are not present and not if the entire group of characters is not present. Please help me out here.

[] in a RegEx is a character group, you can not match strings this way, only characters or unicode codepoints.
If you have escaped HTML entities, you can use htmlspecialchars_decode() to convert them back into characters.
After you have valid HTML, you can use the DOM to to parse, traverse and manipulate it.
How do you parse and process HTML/XML in PHP?

I think i figured it out.Here is the code:
<?php
$input = "<p>text</p>";
$tem = $input;
$tem = htmlspecialchars($input);
$tem = preg_replace('/<p>(.+?)<\/p>/','<p>$1</p>',$tem);
echo $tem;
?>

You don't need to capture the content between p tags, you only need to replace p tags:
$html = preg_replace('~<(/?p)>~', '<$1>', $html);
However, you don't regex too:
$trans = array('<p>' => '<p>', '</p>' => '</p>');
$html = strtr($html, $trans);

At least part of the trouble you're having is probably due to the fact that you seem to be playing fast and loose with the semicolons in your HTML entities. They always start with an ampersand, and end with a semicolon. So it's >, not &gt as you have scattered through your post.
That said, why not use html_entity_decode(), which doesn't require abusing regular expressions?
$string = 'shoop <p>da</p> woop';
echo html_entity_decode($string);
// output: shoop <p>da</p> woop

Remove element via PHP str_replace and regex

I think my the regex is off (not very good at regex yet). What I'm trying to do is remove the first and last <section> tags (though this is set to replace all, if it worked). I set it up like this so it would completely remove any attributes of the tag, along with the closing tag.
The code:
//Remove from string
$content = "<section><p>Test</p></section>";
$section = "<(.*?)section(.*?)>";
$output= str_replace($section, "", $content);
echo $output;

You are looking for strip_tags.
Try this:
print strip_tags($content, '<section>');

Unable to use regex to search in PHP?

I'm trying to get the code of a html document in specific tags.
My method works for some tags, but not all, and it not work for the tag's content I want to get.
Here is my code:
<html>
<head></head>
<body>
<?php
$url = "http://sf.backpage.com/MusicInstruction/";
$data = file_get_contents($url);
$pattern = "/<div class=\"cat\">(.*)<\/div>/";
preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
var_dump($adsLinks);
foreach ($adsLinks as $i) {
echo "<div class='ads'>".$i[0]."</div>";
}
?>
</body>
</html>
The above code doesn't work, but it works when I change the $pattern into:
$pattern = "/<div class=\"date\">(.*)<\/div>/";
or
$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";
I can't see any different between these $pattern. Please help me find the error.
Thanks.

Use PHP DOM to parse HTML instead of regex.
For example in your case (code updated to show HTML):
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');
for ($i = 0; $i < $nodes->length; $i ++)
{
$x = $nodes->item($i);
if($x->getAttribute('class') == 'cat');
echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}

The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try
$pattern = "/<div class=\"cat\">(.*)<\/div>/s";
When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *
$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";
This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.

Strip tags but not those inside <code>

I have seen some solutions, or at least tries, but none of them really work.
How do I strip all tags except those inside <code> or [code] - and replace all the < and > with < etc. in order to let JavaScript do some syntax highlighting on the output?

Why don't you try using strpos() to get the position of [code] and [/code].
When you have the location (assuming you only have one set of the code tag) just get the contents of everything before and everything after and the strip_tags on that text.
Hope this helps.

Use a callback:
$code = 'code: <p>[code]<hi>sss</hi>[/code]</p> more code: <p>[code]<b>sadf</b>[/code]</p>';
function codeFormat($matches)
{
return htmlspecialchars($matches[0]);
}
echo preg_replace_callback('#\[code\](?:(?!\[/code\]).)*\[/code\]#', 'codeFormat', $code);

<?php
$str = '<b><code><b><a></a></b></code></b><code>asdsadas</code>';
$str = str_replace('[code]', '<code>', $str);
$str = str_replace('[/code]', '</code>', $str);
preg_match('/<code>(.*?)<\/code>/', $str, $matches);
$str = strip_tags($str, "<code>");
foreach($matches as $match)
{
$str = preg_replace('/<code><\/code>/', $str, '<code>'.htmlspecialchars($match).'</code>', 1);
}
echo $str;
?>
This searches for the code tags and captures what is within the tags. Strips the tags. Loops through the matches replacing the code tags with the text captured and replacing the < and >.
EDIT: the two str_replace lines added to allow [code] too.

$str = '[code]
<script type="text/javascript" charset="utf-8">
var foo = "bar";
</script>
[/code]
strip me';
echo formatForDisplay( $str );
function formatForDisplay( $output ){
$output = preg_replace_callback( '#\[code]((?:[^[]|\[(?!/?code])|(?R))+)\[/code]#', 'replaceWithValues', $output );
return strip_tags($output);
}
function replaceWithValues( $matches ){
return htmlentities( $matches[ 1 ] );
}
try this should work, i tested it and it seemed to have the desired effect.

Well, I tried a lot with all your given code, right now I am working with this one, but it is still not giving the expected results -
What I want is, a regular textarea, where one can put regular text, hit enter, having a new line, not allowing tags here - maybe <strong> or <b>....
Perfect would be to recognice links and have them surrounded with <a> tags
This text should automatically have <p> and <br /> where needed.
To fill in code in various languages one should type
[code lang=xxx] code [/code] - in the best case [code lang="xxx"] or <code lang=xxx> would work too.
Than typing the code or copy and paste it inside.
The code I am using at the moment, that at least does the changing of tags and output it allright except of tabs and linebreaks is:
public function formatForDisplay( $output ){
$output = preg_replace_callback( '#\[code lang=(php|js|css|html)]((?:[^[]|\[(?!/?code])|(?R))+)\[/code]#', array($this,'replaceWithValues'), $output );
return strip_tags($output,'<code>');
}
public function replaceWithValues( $matches ){
return '<code class="'.$matches[ 1 ].'">'.htmlentities( $matches[ 2 ] ).'</code>';
}
Similar like it works here.

The strip_tag syntax gives you an option to determine the allowable tags:
string strip_tags ( string $str [, string $allowable_tags ] ) -> from PHP manual.
This should give you a start on the right direction I hope.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to stop PHP Domdocument::SaveXML from inserting "CDATA"? - php

I have a suspicion that the CDATA is inserted because it would otherwise be invalid XML. Have you tried using saveHTML instead of saveXML?

Related

How do you remove the new line code ( ) with php?

unable to understand how to match all characters except a given sequence with preg_replace() in php

Remove element via PHP str_replace and regex

Unable to use regex to search in PHP?

Strip tags but not those inside <code>

Categories

Resources