How can I exclude regex href matches of a particular domain?

How can I exclude regex href matches of a particular domain? - php

How can I exclude href matches for a domain (ex. one.com)?
My current code:
$str = 'This string has one link and another link';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com
Desired result:
This string has one link and http://two.com

Using a regular expression
If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.
Here's how the regular expression would look like:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
Regex Visualization:
Regex101 demo
Using a DOM parser
Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:
Here's how the solution would look like:
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the string containing markup
$links = $dom->getElementsByTagName('a');
//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
$node = $links->item($i);
$text = $node->textContent;
$href = $node->getAttribute('href');
if ($href !== 'http://one.com') {
$newTextNode = $dom->createTextNode($text);
$node->parentNode->replaceChild($newTextNode, $node);
}
}
echo $dom->saveHTML();
Live Demo

This should do it:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
We use a negative lookahead to make sure that one.com does not appear directly after the https?://.
If you also want to check for some subdomains of one.com, use this example:
<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>
Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:
<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>

Related

Regex to find anchor tag not working accurately

I have the following regex to find anchor tag that has 'Kontakt' as the anchor text:
#<a.*href="[^"]*".*>Kontakt<\/a>#
Here is the string to find from:
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
So the result should be:
<a href="/kontakt" >Kontakt</a>
But the result I get is:
Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt
And here is my PHP code:
$pattern = '#<a.*href="[^"]*".*>Kontakt<\/a>#';
preg_match_all($pattern, $string, $matches);

You are using preg_match_all() so I assume you are willing to receive multiple qualifying anchor tags. Parsing valid HTML with a legitimate DOM parser will always be more stable and easier to read than the equivalent regex technique. It's just not a good idea to rely on regex for DOM parsing because regex is "DOM-unaware" -- it just matches things that look like HTML entities.
In the XPath query, search for <a> tags (existing at any depth in the document) which have the qualifying string as the whole text.
Code: (Demo)
$html = <<<HTML
<li class="item-133">Wissenswertes</li><li class="item-115"><a href="/team" >Team</li><li class="item-116 menu-parent"></span></li><li class="item-350"><a href="/kontakt" >Kontakt</li></ul>
HTML;
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query('//a[text() = "Kontakt"]') as $a) {
$result[] = $dom->saveHtml($a);
}
var_export($result);
Output:
array (
0 => 'Kontakt',
)
Is it more concise to use regex? Yes, but it is also less reliable for general use.
You will notice that the DOMDocument also automatically cleans up the unnecessary spacing in your markup.

If you can trust your input will always have <a href in every anchor tag then try:
'#<a href="[^"]*"[^>]*>Kontakt<\/a>#';
// Instead of what you have:
'#<a.*href="[^"]*".*>Kontakt<\/a>/#';
.* is the "wildcard" meta-character . and the "zero or more times" quantifier * together.
.* matches anything any number of times.
Try it https://regex101.com/r/qxnRZv/1

Your regex:
...a.*href...
is greedy, which means: "after a, match as many characters as possible before a href". That causes your regex to return multiple hrefs.
You can use the lazy-mode operator ? :
...a.*?href....
which means "after a, match as few characters as possible before a href". It should work.

preg_match all paragraphs in a string

The following string contains multiple <p> tags. I want to match the contents of each of the <p> with a pattern, and if it matches, I want to add a css class to that specific paragraph.
For example in the following string, only the second paragraph content matches, so i want to add a class to that paragraph only.
$string = '<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>';
With the following code, I can match all of the string, but I am unable to figure out how to find the specific paragraph.
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
$return = preg_match($rtl_chars_pattern, $string);

Create a capture group on the <p> tag
Use preg_replace
https://regex101.com/r/nE5pT1/1
$str = "<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>";
$result = preg_replace("/(<p>)[\\x{0590}-\\x{05ff}\\x{0600}-\\x{06ff}]/u", "<p class=\"foo\">", $str, 1);

Use a combination of SimpleXML, XPath and regular expressions (regex on text(), etc. are only supported as of XPath 2.0).
The steps:
Load the DOM first
Get all p tags via an xpath query
If the text / node value matches your regex, apply a css class
This is the actual code:
<?php
$html = "<html><p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p></html>";
$xml = simplexml_load_string($html);
# query the dom for all p tags
$ptags = $xml->xpath("//p");
# your regex
$regex = '~[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]~u';
# alternatively:
# $regex = '~\p{Arabic}~u';
# loop over the tags, if the regex matches, add another attribute
foreach ($ptags as &$p) {
if (preg_match($regex, (string) $p))
$p->addAttribute('class', 'some cool css class');
}
# just to be sure the tags have been altered
echo $xml->asXML();
?>
See a demo on ideone.com. The code has the advantage that you only analyze the content of the p tag, not the DOM structure in general.

regex to replace mailto: hrefs but ignore site links

I need some help to tweak this regular expression:
$content = 'more test test Jeff this is a test';
$content = preg_replace("~<a .*?href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>~", "$1", $content);
This expression is to strip the html markup off a mailto link and just return the email (jeff#test.com)
It works fine except for in the example I gave above - because a unlimited number of whitespaces is allowed before the href in the pattern, when a website link is before the mailto link, the regex looks all the way forward until it finds the mailto: in the following link and removes all the content in between.
maybe a fix would be to just limit it to two or three whitespaces after the opening tag so as to not look so far ahead, but i wonder if there is a better solution from people who know regex better than I?

Here is what you should be using...
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('a') as $a) {
if ($a->hasAttribute('href')
AND strpos($href = trim($a->getAttribute('href')), 'mailto:') === 0) {
$textNode = $dom->createTextNode(substr($href, 7));
$parent = $a->parentNode;
$parent->insertBefore($textNode, $a);
$parent->removeChild($a);
}
}
CodePad.
$dom->saveHTML() adds all the HTML boiler plate stuff such as html and body element, you can remove them with...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
$html .= $dom->saveHTML($node);
}
CodePad.

The problem is not to allow any amount of whitespace, that would be working. The problem is you allow one space and any amount of ANY character with your <a .*
If you fix this and allow really only whitespace like this
<a\s+href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>
it seems to work.
See it here at Regexr
But probably you should have a closer look at alex answer (+1 for the example) as this would be the cleaner solution.

preg_replace only OUTSIDE tags ? (... we're not talking full 'html parsing', just a bit of markdown)

What is the easiest way of applying highlighting of some text excluding text within OCCASIONAL tags "<...>"?
CLARIFICATION: I want the existing tags PRESERVED!
$t =
preg_replace(
"/(markdown)/",
"<strong>$1</strong>",
"This is essentially plain text apart from a few html tags generated with some
simplified markdown rules: <a href=markdown.html>[see here]</a>");
Which should display as:
"This is essentially plain text apart from a few html tags generated with some simplified markdown rules: see here"
... BUT NOT MESS UP the text inside the anchor tag (i.e. <a href=markdown.html> ).
I've heard the arguments of not parsing html with regular expressions, but here we're talking essentially about plain text except for minimal parsing of some markdown code.

Actually, this seems to work ok:
<?php
$item="markdown";
$t="This is essentially plain text apart from a few html tags generated
with some simplified markdown rules: <a href=markdown.html>[see here]</a>";
//_____1. apply emphasis_____
$t = preg_replace("|($item)|","<strong>$1</strong>",$t);
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=
// <strong>markdown</strong>.html>[see here]</a>"
//_____2. remove emphasis if WITHIN opening and closing tag____
$t = preg_replace("|(<[^>]+?)(<strong>($item)</strong>)([^<]+?>)|","$1$3$4",$t);
// this preserves the text before ($1), after ($4)
// and inside <strong>..</strong> ($2), but without the tags ($3)
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=markdown.html>
// [see here]</a>"
?>
A string like $item="odd|string" would cause some problems, but I won't be using that kind of string anyway... (probably needs htmlentities(...) or the like...)

You could split the string into tag‍/‍no-tag parts using preg_split:
$parts = preg_split('/(<(?:[^"\'>]|"[^"<]*"|\'[^\'<]*\')*>)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
Then you can iterate the parts while skipping every even part (i.e. the tag parts) and apply your replacement on it:
for ($i=0, $n=count($parts); $i<$n; $i+=2) {
$parts[$i] = preg_replace("/(markdown)/", "<strong>$1</strong>", $parts[$i]);
}
At the end put everything back together with implode:
$str = implode('', $parts);
But note that this is really not the best solution. You should better use a proper HTML parser like PHP’s DOM library. See for example these related questions:
Highlight keywords in a paragraph
Regex / DOMDocument - match and replace text not in a link

First replace any string after a tag, but force your string is after a tag:
$t=preg_replace("|(>[^<]*)(markdown)|i",'$1<strong>$2</strong>',"<null>$t");
Then delete your forced tag:
$show=preg_replace("|<null>|",'',$show);

You could split your string into an array at every '<' or '>' using preg_split(), then loop through that array and replace only in entries not beginning with an '>'. Afterwards you combine your array to an string using implode().

This regex should strip all HTML opening and closing tags: /(<[.*?]>)+/
You can use it with preg_replace like this:
$test = "Hello <strong>World!</strong>";
$regex = "/(<.*?>)+/";
$result = preg_replace($regex,"",$test);

actually this is not very efficient, but it worked for me
$your_string = '...';
$search = 'markdown';
$left = '<strong>';
$right = '</strong>';
$left_Q = preg_quote($left, '#');
$right_Q = preg_quote($right, '#');
$search_Q = preg_quote($search, '#');
while(preg_match('#(>|^)[^<]*(?<!'.$left_Q.')'.$search_Q.'(?!'.$right_Q.')[^>]*(<|$)#isU', $your_string))
$your_string = preg_replace('#(^[^<]*|>[^<]*)(?<!'.$left_Q.')('.$search_Q.')(?!'.$right_Q.')([^>]*<|[^>]*$)#isU', '${1}'.$left.'${2}'.$right.'${3}', $your_string);
echo $your_string;

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.

First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.

try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php

The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.

I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php

It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How can I exclude regex href matches of a particular domain? - php

Related

Regex to find anchor tag not working accurately

preg_match all paragraphs in a string

regex to replace mailto: hrefs but ignore site links

preg_replace only OUTSIDE tags ? (... we're not talking full 'html parsing', just a bit of markdown)

Using regex to remove HTML tags

Categories

Resources