PHP DomDocument to replace pattern - php

I need to find and replace http links to hyperlinks. These http links are inside span tags.
$text has html page. One of the span tags has something like
<span class="styleonetwo" >http://www.cnn.com/live-event</span>
Here is my code:
$doc = new DOMDocument();
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('span') as $anchor) {
$link = $anchor->nodeValue;
if(substr($link, 0, 4) == "http")
{
$link = "$link";
}
if(substr($link, 0, 3) == "www")
{
$link = "$link";
}
$anchor->nodeValue = $link;
}
echo $doc->saveHTML();
It works ok. However...I want this to work even if the data inside span is something like:
<span class="styleonetwo" > sometexthere http://www.cnn.com/live-event somemoretexthere</span>
Obviously above code wont work for this situation. Is there a way we can search and replace a pattern using DOMDocument without using preg_replace?
Update: To answer phil's question regarding preg_replace:
I used regexpal.com to test the following pattern matching:
\b(?:(?:https?|ftp|file)://|(www|ftp)\.)[-A-Z0-9+&##/%?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]
It works great in the regextester provided in regexpal. When I use the same pattern in PHP code, I got tons of weird errors. I got unknown modifier error even for escape character! Following is my code for preg_replace
$httpRegex = '/\b(\?:(\?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&##/%\?=~_|$!:,.;]*[-A-Z0-9+&##/%=~_|$]/';
$cleanText = preg_replace($httpRegex, "<a href='$0'>$0</a>", $text);
I was so frustrated with "unknown modifiers" and pursued DOMDocument to solve my problem.

Regular expressions well suit this problem - so better use preg_replace.
Now you just have several unescaped delimiters in your pattern, so escape them or choose another character as the delimiter - for instance, ^. Thus, the correct pattern would be:
$httpRegex = '^\b(?:(?:https?|ftp|file):\/\/|(www|ftp)\.)[-A-Z0-9+&##\/%\?=~_|$!:,.;]*[-A-Z0-9+&##\/%=~_|$]^i';

Related

How can I exclude regex href matches of a particular domain?

How can I exclude href matches for a domain (ex. one.com)?
My current code:
$str = 'This string has one link and another link';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com
Desired result:
This string has one link and http://two.com
Using a regular expression
If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.
Here's how the regular expression would look like:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
Regex Visualization:
Regex101 demo
Using a DOM parser
Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:
Here's how the solution would look like:
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the string containing markup
$links = $dom->getElementsByTagName('a');
//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
$node = $links->item($i);
$text = $node->textContent;
$href = $node->getAttribute('href');
if ($href !== 'http://one.com') {
$newTextNode = $dom->createTextNode($text);
$node->parentNode->replaceChild($newTextNode, $node);
}
}
echo $dom->saveHTML();
Live Demo
This should do it:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
We use a negative lookahead to make sure that one.com does not appear directly after the https?://.
If you also want to check for some subdomains of one.com, use this example:
<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>
Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:
<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>

unable to understand how to match all characters except a given sequence with preg_replace() in php

So what I am trying to do is to match a regular expression which has an opening <p>; tag and a closing &lt/;p> tag.This is the code I wrote:
<?php
$input = "<p&gtjust some text</p&gt more text!";
$input = preg_replace('/<p&gt[^(<\/p&gt)]+?&lt\/;p&gt/','<p>$1</p>',$tem);
echo $input;
?>
So the code does not seem to replace <p&gt with <p> or replace </p&gt with </p>.I think the problem is in the part where I am checking all characters expect '</p&gt. I don't think the code [^(<\/p&gt)] is grouping all the characters correctly. I think it checks if any of the characters are not present and not if the entire group of characters is not present. Please help me out here.
[] in a RegEx is a character group, you can not match strings this way, only characters or unicode codepoints.
If you have escaped HTML entities, you can use htmlspecialchars_decode() to convert them back into characters.
After you have valid HTML, you can use the DOM to to parse, traverse and manipulate it.
How do you parse and process HTML/XML in PHP?
I think i figured it out.Here is the code:
<?php
$input = "<p>text</p>";
$tem = $input;
$tem = htmlspecialchars($input);
$tem = preg_replace('/<p>(.+?)<\/p>/','<p>$1</p>',$tem);
echo $tem;
?>
You don't need to capture the content between p tags, you only need to replace p tags:
$html = preg_replace('~<(/?p)>~', '<$1>', $html);
However, you don't regex too:
$trans = array('<p>' => '<p>', '</p>' => '</p>');
$html = strtr($html, $trans);
At least part of the trouble you're having is probably due to the fact that you seem to be playing fast and loose with the semicolons in your HTML entities. They always start with an ampersand, and end with a semicolon. So it's >, not &gt as you have scattered through your post.
That said, why not use html_entity_decode(), which doesn't require abusing regular expressions?
$string = 'shoop <p>da</p> woop';
echo html_entity_decode($string);
// output: shoop <p>da</p> woop

Funny behavior with html paragraph tags

$regex = '#<p.+</p>#s';
My objective is to return the large string that occurs between the first paragraph tag, and the last paragraph tag. This is to include everything, even other paragraphs.
My regex above works for everything EXCEPT the paragraph tags. I tested it replacing the 'p' with 'html' and returned success, replaced with 'script' and returned success... Why would this return true for those cases but not for the paragraph?
I am still working on this, and relatively convinced that there is no strange escape sequence that is causing the regex to stop... I think this because I can extract everything between the first and last 'html' tag. The text between the 'html' tags also contains all of the 'p' tags that I am failing to extract. If there were some kind of escape or error, I think it would also throw the same error when extracting for the 'html' tags. I have tried preg_quote() with no success.
Perhaps I need to set memory devoted to regex processing higher so that it can process the whole document?
Update: In most cases the leading 'p' will (in most cases) NOT be the ending '/p' tag for the same paragraph tag.
Update: The returned results will be something akin to:
<p>this is the first tag</p>this is a bunch of text from the document, could be all manner of tags <p>this is the last paragraph tag</p>
Update: Code example
$htmlArticle = <<< 'ENDOFHTML'
Insert data from pastebin here
http://pastebin.com/4A3FYGc8
ENDOFHTML;
$pattern = '#<html.+/html>#s'; // Works fine, returns all characters between first <html and last /html
$pattern = '#<script.+/script>#s'; // Works fine, same as above
$pattern = '#<p.+/p>#s'; // Returns nothing, nothing at all. :'(
preg_match($pattern, $htmlArticle, $matches);
var_dump($matches);
?>
Solution:
ini_set('pcre.backtrack_limit', '1000000');
I had exhausted my backtrack limit. This is a setting in your php.ini file, and can be set in code with ini_set(). Curiously, I set the value with ini_set() to match that in my php.ini file... So it should have worked from the start. --- Thanks coming as soon as I can post a solution.
That is very curious. It's not returning an error, and using a shorter document seems to return a match. I can't understand why this would happen. I've used regexes on enormous documents without trouble.
Note that this produces a match: #<p\b.+<\#s
Perhaps try playing with the backtrack limit, since there are many </p> matches. However if the limit were too low I would expect preg_match to return False, not 0!
As a workaround, try this instead:
function extractBetweenPs($data) {
$startoffset = null;
$endoffset = null;
if (preg_match('/<p\b/', $data, $matches, PREG_OFFSET_CAPTURE)) {
$startoffset = $matches[0][1];
$needle = '</p>';
$endoffset = strrpos($data, $needle);
if ($endoffset !== FALSE) {
$endoffset += strlen($needle);
} else {
// this will return everything from '<p' to the end of the doc
// if there is no '</p>'
// maybe not what you want?
$endoffset = strlen($data);
}
return substr($data, $startoffset, $endoffset-$startoffset);
}
return '';
}
That said, this is a very strange requirement--treating an arbitrary section of a structured document as a blob. Maybe you could step back and say what your broader goal is and we can suggest another approach?
Regex is not a tool that can be used to correctly parse HTML.
All you need is DOMDocument
$dom = new DOMDocument();
$dom->loadHTML($your_html);
$node = $dom->getElementsByTagName('p')->item(0);
$dom2 = new DOMDocument();
$node = $dom2->importNode($node, true);
$dom2->appendChild($node);
echo $dom2->saveHTML();

regex to replace mailto: hrefs but ignore site links

I need some help to tweak this regular expression:
$content = 'more test test Jeff this is a test';
$content = preg_replace("~<a .*?href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>~", "$1", $content);
This expression is to strip the html markup off a mailto link and just return the email (jeff#test.com)
It works fine except for in the example I gave above - because a unlimited number of whitespaces is allowed before the href in the pattern, when a website link is before the mailto link, the regex looks all the way forward until it finds the mailto: in the following link and removes all the content in between.
maybe a fix would be to just limit it to two or three whitespaces after the opening tag so as to not look so far ahead, but i wonder if there is a better solution from people who know regex better than I?
Here is what you should be using...
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('a') as $a) {
if ($a->hasAttribute('href')
AND strpos($href = trim($a->getAttribute('href')), 'mailto:') === 0) {
$textNode = $dom->createTextNode(substr($href, 7));
$parent = $a->parentNode;
$parent->insertBefore($textNode, $a);
$parent->removeChild($a);
}
}
CodePad.
$dom->saveHTML() adds all the HTML boiler plate stuff such as html and body element, you can remove them with...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
$html .= $dom->saveHTML($node);
}
CodePad.
The problem is not to allow any amount of whitespace, that would be working. The problem is you allow one space and any amount of ANY character with your <a .*
If you fix this and allow really only whitespace like this
<a\s+href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>
it seems to work.
See it here at Regexr
But probably you should have a closer look at alex answer (+1 for the example) as this would be the cleaner solution.

Remove all text between <hr> and <embed> tag?

<hr>I want to remove this text.<embed src="stuffinhere.html"/>
I tried using regex but nothing works.
Thanks in advance.
P.S. I tried this: $str = preg_replace('#(<hr>).*?(<embed)#', '$1$2', $str)
You'll get a lot of advice to use an HTML parser for this kind of thing. You should do that.
The rest of this answer is for when you've decided that the HTML parser is too slow, doesn't handle ill formed (i.e. standard in the wild) HTML, or is a pain in the ass to integrate into the system you don't control. I created the following small shell script
$str = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$str = preg_replace('#(<hr>).*?(<embed)#', '$1$2', $str);
var_dump($str);
//outputs
string(35) "<hr><embed src="stuffinhere.html"/>"
and it did remove the text, so I'd check your source documents and any other PHP code around your RegEx. You're not feeding preg_replace the string you think you are. My best guess is your source document has irregular case, or there's whitespace between the <hr /> and <embed>. Try the following regular expression instead.
$str = '<hr>I want to remove
this text.
<EMBED src="stuffinhere.html"/>';
$str = preg_replace('#(<hr>).*?(<embed)#si', '$1$2', $str);
var_dump($str);
//outputs
string(35) "<hr><EMBED src="stuffinhere.html"/>"
The "i" modifier says "make this search case insensitive". The "s" modifier says "the [.] character should also match my platform's line break/carriage return sequence"
But use a proper parser if you can. Seriously.
I think the code is self-explanatory and pretty easy to understand since it does not use regex (and it might be faster)...
$start='<hr>';
$end='<embed src="stuff...';
$str=' html here... ';
function between($t1,$t2,$page) {
$p1=stripos($page,$t1);
if($p1!==false) {
$p2=stripos($page,$t2,$p1+strlen($t1));
} else {
return false;
}
return substr($page,$p1+strlen($t1),$p2-$p1-strlen($t1));
}
$found=between($start,$end,$str);
while($found!==false) {
$str=str_replace($start.$found.$end,$start.$end,$str);
$found=between($start,$end,$str);
}
// do something with $str here...
$text = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$text = preg_replace('#(<hr>).*?(<embed.*?>)#', '$1$2', $text);
echo $text;
If you want to hard code src in embed tag:
$text = '<hr>I want to remove this text.<embed src="stuffinhere.html"/>';
$text = preg_replace('#(<hr>).*?(<embed src="stuffinhere.html"/>)#', '$1$2', $text);
echo $text;

Categories