PHP - Inner HTML recursive replace - php

I need to perform a recursive str_replace on a portion of HTML (with recursive I mean inner nodes first), so I wrote:
$str = //get HTML;
$pttOpen = '(\w+) *([^<]{1,100}?)';
$pttClose = '\w+';
$pttHtml = '(?:(?!(?:<x-)).+)';
while (preg_match("%<x-(?:$pttOpen)>($pttHtml)*</x-($pttClose)>%m", $str, $match)) {
list($outerHtml, $open, $attributes, $innerHtml, $close) = $match;
$newHtml = //some work....
str_replace($outerHtml, $newHtml, $str);
}
The idea is to first replace non-nested x-tags.
But it only works if innerHtml in on the same line of the opening tag (so I guess I misunderstood what the /m modifier does). I don't want to use a DOM library, because I just need simple string replacement. Any help?

Try this regex:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s
Demo
http://regex101.com/r/nA2zO5
Sample code
$str = // get HTML
$pattern = '%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*)>(?P<innerHtml>.*)</x-(?P=open)>%s';
while (preg_match($pattern, $str, $matches)) {
$newHtml = sprintf('<ns:%1$s>%2$s</ns:%1$s>', $matches['open'], $matches['innerHtml']);
$str = str_replace($matches[0], $newHtml, $str);
}
echo htmlspecialchars($str);
Output
Initially, $str contained this text:
<x-foo>
sdfgsdfgsd
<x-bar>
sdfgsdfg
</x-bar>
<x-baz attr1='5'>
sdfgsdfg
</x-baz>
sdfgsdfgs
</x-foo>
It ends up with:
<ns:foo>
sdfgsdfgsd
<ns:bar>
sdfgsdfg
</ns:bar>
<ns:baz>
sdfgsdfg
</ns:baz>
sdfgsdfgs
</ns:foo>
Since, I didn't know what work is done on $newHtml, I mimic this work somehow by replacing x-with ns: and removing any attributes.

Thanks to #Alex I came up with this:
%<x-(?P<open>\w+)\s*(?P<attributes>[^>]*?)>(?P<innerHtml>((?!<x-).)*)</x-(?P=open)>%is
Without the ((?!<x-).)*) in the innerHtml pattern it won't work with nested tags (it will first match outer ones, which isn't what I wanted). This way innermost ones are matched first. Hope this helps.

I don't know exactly what kind of changes you are trying to do, however this is the way I will proceed:
$pattern = <<<'EOD'
~
<x-(?<tagName>\w++) (?<attributes>[^>]*+) >
(?<content>(?>[^<]++|<(?!/?x-))*) #by far more efficient than (?:(?!</?x-).)*
</x-\g<tagName>>
~x
EOD;
function callback($m) { // exemple function
return '<n-' . $m['tagName'] . $m['attributes'] . '>' . $m['content']
. '</n-' . $m['tagName'] . '>';
};
do {
$code = preg_replace_callback($pattern, 'callback', $code, -1, $count);
} while ($count);
echo htmlspecialchars(print_r($code, true));

Related

How to get the string after a certain a HTML Dom

Here is the html:
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade#email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
The output I want is:
+88 01756567676
frank.wade#email.com
NAC739
I used simple_html_dom to parse the data.
Here's the code I wrote. It works if the contact info part is wrapped with a paragraph tag. ()
$contact = $facultyData->find('strong[plaintext^=Phone]');
$contact = $contact[0]->parent();
$element = explode("\n", strip_tags($contact->plaintext));
$regex = '/Phone:(.*)/';
if (preg_match($regex, $element[0], $match))
$phone = $match[1];
$regex = '/Email:(.*)/';
if (preg_match($regex, $element[1], $match))
$email = $match[1];
$regex = '/Office:(.*)/';
if (preg_match($regex, $element[2], $match))
$office = $match[1];
Is there any way to get those 3 lines by matching with tag?
maybe you could use xpath function like
$xml = new SimpleXMLElement($DomAsString);
$theText = $xml->xpath('//strong[. ="Phone"]/following-sibling::text()');
some snippings to remove the ': ', and of course fixing the dom structure
Or just use straight regex:
preg_match('|Phone</strong>: [^<]+|', $str, $m) or die('no phone');
$phone = $m[1];
You really don't need to parse this as HTML or deal with DOM tree. You can explode your HTML string into pieces, then remove what is extra in each piece to get what you want:
<?php
$str = <<<str
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade#email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
str;
// We explode $str and use '</strong>' as delimiter and get only the part of result that we need
$lines = array_slice(explode('</strong>', $str), 3, 3);
// Define a function to remove extra text from left and right of our so called lines
function stripLine($line) {
// ltrim ' ;' characters and remove everything after (and including) '<br>'
return preg_replace('/<br>.*/is', '', ltrim($line, ' :'));
}
$lines = array_map('stripLine', $lines);
print_r($lines);
See code output here.

PhP Find (and replace) string between two different strings

I have a string, that look like this "<html>". Now what I want to do, is get all text between the "<" and the ">", and this should apply to any text, so that if i did "<hello>", or "<p>" that would also work. Then I want to replace this string with a string that contains the string between the tags.
For example
In:
<[STRING]>
Out:
<this is [STRING]>
Where [STRING] is the string between the tags.
Use a capture group to match everything after < that isn't >, and substitute that into the replacement string.
preg_replace('/<([^>]*)>/, '<this is $1>/, $string);
here is a solution to test on the pattern exists and then capture it to finally modify it ...
<?php
$str = '<[STRING]>';
$pattern = '#<(\[.*\])>#';
if(preg_match($pattern, $str, $matches)):
var_dump($matches);
$str = preg_replace($pattern, '<this is '.$matches[1].'>', $str);
endif;
echo $str;
?>
echo $str;
You can test here: http://ideone.com/uVqV0u
I don't know if this can be usefull to you.
You can use a regular expression that is the best way. But you can also consider a little function that remove first < and last > char from your string.
This is my solution:
<?php
/*Vars to test*/
$var1="<HTML>";
$var2="<P>";
$var3="<ALL YOU WANT>";
/*function*/
function replace($string_tag) {
$newString="";
for ($i=1; $i<(strlen($string_tag)-1); $i++){
$newString.=$string_tag[$i];
}
return $newString;
}
/*Output*/
echo (replace($var1));
echo "\r\n";
echo (replace($var2));
echo "\r\n";
echo (replace($var3));
?>
Output give me:
HTML
P
ALL YOU WANT
Tested on https://ideone.com/2RnbnY

Add id attribute to hyperlinks through PHP Regular Expressions

I am still relatively new to Regular Expressions and feel My code is being too greedy. I am trying to add an id attribute to existing links in a piece of code. My functions is like so:
function addClassHref($str) {
//$str = stripslashes($str);
$preg = "/<[\s]*a[\s]*href=[\s]*[\"\']?([\w.-]*)[\"\']?[^>]*>(.*?)<\/a>/i";
preg_match_all($preg, $str, $match);
foreach ($match[1] as $key => $val) {
$pattern[] = '/' . preg_quote($match[0][$key], '/') . '/';
$replace[] = "<a id='buttonRed' href='$val'>{$match[2][$key]}</a>";
}
return preg_replace($pattern, $replace, $str);
}
This adds the id tag like I want but it breaks the hyperlink. For example:
If the original code is : Link
Instead of <a id="class" href="http://www.google.com">Link</a>
It is giving
<a id="class" href="http">Link</a>
Any suggestions or thoughts?
Do not use regular expressions to parse XML or HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$all_a = $doc->getElementsByTagName('a');
$firsta = $all_a->item(0);
$firsta->setAttribute('id', 'idvalue');
echo $doc->saveHTML($firsta);
You've got some overcomplications in your regex :)
Also, there's no need for the loop as preg_replace() will hit all the instances of the search pattern in the relevant string. The first regex below will take everything in the a tag and simply add the id attribute on at the end.
$str = 'Link' . "\n" .
'Link' . "\n" .
'Link';
$p = "{<\s*a\s*(href=[^>]*)>([^<]*)</a>}i";
$r = "<a $1 id=\"class\">$2</a>";
echo preg_replace($p, $r, $str);
If you only want to capture the href attribute you could do the following:
$p = '{<\s*a\s*href=["\']([^"\']*)["\'][^>]*>([^<]*)</a>}i';
$r = "<a href='$1' id='class'>$2</a>";
Your first subpattern ([\w.-]*) doesn't match :, thus it stops at "http".
Couldn't you just use a simple str_replace() for this? Regex seems like overkill if this is all you're doing.
$str = str_replace('<a ', '<a id="someID" ', $str);

Removal of bad hyperlinks and the content inside of them

Ok, basically I have an array of bad urls and I would like to search through a string and strip them out. I want to strip everything from the opening tag to the closing tag, but only if the url in the hyperlink is in the array of bad urls. Here is how I would picture it working but I don't understand regular expressions well.
foreach($bad_urls as $bad_url){
$pattern = "/<a*$bad_url*</a>/";
$replacement = ' ';
preg_replace($pattern, $replacement, $content);
}
Thanks in advance.
Assuming that your 'bad urls' are properly formatted URLs, I would suggest doing something like this:
foreach($bad_urls as $bad_url){
$pattern = '/<[aA]\s.+[href|HREF]\=\"' . convert_to_pattern($bad_url) . '\".+<\/[aA]>/msU';
$replacement = ' ';
$content = preg_replace_all($pattern, $replacement, $content);
}
and separately
function convert_to_pattern($url)
{
searches = array('%', '&', '?', '.', '/', ';', ' ');
replaces = array('\%','\&','\?','\.','\/','\;','\ ');
return preg_replace_all($searches, $replaces, $url);
}
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, find all the <a> tags and check the href property. Much simpler and fool-proof.

Strip tags but not those inside <code>

I have seen some solutions, or at least tries, but none of them really work.
How do I strip all tags except those inside <code> or [code] - and replace all the < and > with < etc. in order to let JavaScript do some syntax highlighting on the output?
Why don't you try using strpos() to get the position of [code] and [/code].
When you have the location (assuming you only have one set of the code tag) just get the contents of everything before and everything after and the strip_tags on that text.
Hope this helps.
Use a callback:
$code = 'code: <p>[code]<hi>sss</hi>[/code]</p> more code: <p>[code]<b>sadf</b>[/code]</p>';
function codeFormat($matches)
{
return htmlspecialchars($matches[0]);
}
echo preg_replace_callback('#\[code\](?:(?!\[/code\]).)*\[/code\]#', 'codeFormat', $code);
<?php
$str = '<b><code><b><a></a></b></code></b><code>asdsadas</code>';
$str = str_replace('[code]', '<code>', $str);
$str = str_replace('[/code]', '</code>', $str);
preg_match('/<code>(.*?)<\/code>/', $str, $matches);
$str = strip_tags($str, "<code>");
foreach($matches as $match)
{
$str = preg_replace('/<code><\/code>/', $str, '<code>'.htmlspecialchars($match).'</code>', 1);
}
echo $str;
?>
This searches for the code tags and captures what is within the tags. Strips the tags. Loops through the matches replacing the code tags with the text captured and replacing the < and >.
EDIT: the two str_replace lines added to allow [code] too.
$str = '[code]
<script type="text/javascript" charset="utf-8">
var foo = "bar";
</script>
[/code]
strip me';
echo formatForDisplay( $str );
function formatForDisplay( $output ){
$output = preg_replace_callback( '#\[code]((?:[^[]|\[(?!/?code])|(?R))+)\[/code]#', 'replaceWithValues', $output );
return strip_tags($output);
}
function replaceWithValues( $matches ){
return htmlentities( $matches[ 1 ] );
}
try this should work, i tested it and it seemed to have the desired effect.
Well, I tried a lot with all your given code, right now I am working with this one, but it is still not giving the expected results -
What I want is, a regular textarea, where one can put regular text, hit enter, having a new line, not allowing tags here - maybe <strong> or <b>....
Perfect would be to recognice links and have them surrounded with <a> tags
This text should automatically have <p> and <br /> where needed.
To fill in code in various languages one should type
[code lang=xxx] code [/code] - in the best case [code lang="xxx"] or <code lang=xxx> would work too.
Than typing the code or copy and paste it inside.
The code I am using at the moment, that at least does the changing of tags and output it allright except of tabs and linebreaks is:
public function formatForDisplay( $output ){
$output = preg_replace_callback( '#\[code lang=(php|js|css|html)]((?:[^[]|\[(?!/?code])|(?R))+)\[/code]#', array($this,'replaceWithValues'), $output );
return strip_tags($output,'<code>');
}
public function replaceWithValues( $matches ){
return '<code class="'.$matches[ 1 ].'">'.htmlentities( $matches[ 2 ] ).'</code>';
}
Similar like it works here.
The strip_tag syntax gives you an option to determine the allowable tags:
string strip_tags ( string $str [, string $allowable_tags ] ) -> from PHP manual.
This should give you a start on the right direction I hope.

Categories