This question already has answers here:
php regex to match outside of html tags
(4 answers)
Closed 3 years ago.
I'm working on some code to replace words inside the WordPress content for links. For example: the word "example" needs to be replaced for a text link: example.
I've got this working with the following code:
function word_replace($text){
$site = esc_url( home_url() );
$replace = array(
'example' => 'example',
'word' => 'word',
);
$text = str_replace(array_keys($replace), $replace, $text);
return $text;
}
The only issue is that words inside a href="" attribute also get replaced and this breaks the HTML. How do I avoid words from being replaced inside a href="" attribute or inside a class="" attribute? What regex do I need to skip these attributes? A piece of example code would be a big help :-)
TRY THIS OUT
$site = 'http://example.com';
$html = 'Link';
$dom = new DomDocument;
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('a');
$node = $nodes[0];
$node->setAttribute('href', 'page.html');
echo $dom->saveHTML($node);
Related
This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
I'm trying to write a function, which will find each substring in string, where substring is some html tag, for example
<li>
But my regular expression don't work and i can't finde my mistake.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$items = preg_match_all('/(<li>\w+<\/li>)', $str, $matches);
$items must be an array of the desired substrings
Consider using DOMDocument to parse and manipulate HTML or XML tags. Do not reinvent the wheel with Regex.
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
$dom = new DOMDocument();
$dom->loadHTML($str);
$li = $dom->getElementsByTagName('li');
$value = $li->item(0)->nodeValue;
echo $value;
' hello'
Or if you want to iterate over all
foreach($li as $item)
echo $item->nodeValue, PHP_EOL;
' hello'
'how are you?'
Markus' answer is correct but in case you just want the fast and dirty regex one, here it is:
$str = 'hello brbrbr <li> hello</li> <li>how are you?</li>';
preg_match_all('/(<li>.+<\/li>)/U', $str, $items);
U makes it ungreedy.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Replace URLs in text with HTML links
(17 answers)
Closed 2 years ago.
I am using the following regex to replace plain URLs with html links in a text:
preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', '$1 ', $text_msg);
Now I want to modify the regex in a way that, it only replaces the URL only if there is no double quotes behind it and therefore is not part of a tag (i.e. the url is at the start of the string, start of a line or after a space).
Examples:
This is the link <a href="http://test.com"> ... (URL should not be replaced)
http://test.com (at the begenning of a line or the whole multi-line string should be replaced)
This is the site: http://test.com (URL should be replaced)
Thanks.
Your question actually breaks down into two smaller problems. You've already solved one of them, which is parsing the URL with a regular expression. The second part is extracting text from HTML, which isn't easily solved by a regular expression at all. The confusion you have is in trying to do both at the same with a regular expression (parsing HTML and parsing the URL). See the parsing HTML with regex SO Answer for more details on why this is a bad idea.
So instead, let's just use an HTML parser (like DOMDocument) to extract text nodes from the HTML and parse URLs inside those text nodes.
Here's an example
<?php
$html = <<<'HTML'
<p>This is a URL http://abcd/ims in text</p>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Let's walk the entire DOM tree looking for text nodes
function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from walk($n);
}
}
}
foreach (walk($dom->firstChild) as $node) {
if ($node instanceof DOMText) {
// lets find any links and change them to HTML
if (preg_match('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', $node->nodeValue, $match)) {
$node->nodeValue = preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', "\xff ",
$node->nodeValue);
$nodeSplit = explode("\xff", $node->nodeValue, 2);
$node->nodeValue = $nodeSplit[1];
$newNode = $dom->createTextNode($nodeSplit[0]);
$href = $dom->createElement('a', $match[1]);
$href->setAttribute('href', $match[1]);
$node->parentNode->insertBefore($newNode, $node);
$node->parentNode->insertBefore($href, $node);
}
}
}
echo $dom->saveHTML();
Which gives you the desired HTML as output:
<p>This is a URL http://abcd/ims in text</p>
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
This is a script code that is not mine, I try to modify it. What it does search for all the tags and then delete them. How would you modify the code to erase only the tags of a given domain or url? for example, delete the domain tags: www.domainurl.com , Remove all tags as:
fsdf
<a title="Google Adsense" href="https://www.domainurl.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">fgddf</a>
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">googled</a>
result would look like this:
fsdf
fgddf
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">google</a>
This is the code :
if (in_array ( 'OPT_STRIP', $camp_opt )) {
echo '<br>Striping links ';
//$abcont = strip_tags ( $abcont, '<p><img><b><strong><br><iframe><embed><table><del><i><div>' );
preg_match_all('{<a.*?>(.*?)</a>}' , $abcont , $allLinksMatchs);
$allLinksTexts = $allLinksMatchs[1];
$allLinksMatchs=$allLinksMatchs[0];
$j = 0;
foreach ($allLinksMatchs as $singleLink){
if(! stristr($singleLink, 'twitter.com'))
$abcont = str_replace($singleLink, $allLinksTexts[$j], $abcont);
$j++;
}
}
I tried doing this but it did not work for me:
Regex :
Specifying in the search with preg_match_all
preg_match_all('{<a.*?[^>]* href="((https?:\/\/)?([\w\-])+\.{1}domainurl\.([a-z]{2,6})([\/\w\.-]*)*\/?)">(.*?)</a>}' , $abcont , $allLinksMatchs);
Any ideas? , I would thank you a lot
Rather than try and parse HTML with regular expressions, as you suggested, I have chosen to use the DOMDocument class instead.
function remove_domain($str, $domainsToRemove)
{
$domainsToRemove = is_array($domainsToRemove) ? $domainsToRemove : array_slice(func_get_args(), 1);
$dom = new DOMDocument;
$dom->loadHTML("<div>{$str}</div>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$anchors = $dom->getElementsByTagName('a');
// Code taken and modified from: http://php.net/manual/en/domnode.replacechild.php#50500
$i = $anchors->length - 1;
while ($i > -1) {
$anchor = $anchors->item($i);
foreach ($domainsToRemove as $domain) {
if (strpos($anchor->getAttribute('href'), $domain) !== false) {
// $new = $dom->createElement('p', $anchor->textContent);
$new = $dom->createTextNode($anchor->textContent);
$anchor->parentNode->replaceChild($new, $anchor);
}
}
$i--;
}
// Create HTML string, then remove the wrapping div.
$html = $dom->saveHTML();
$html = substr($html, 5, strlen($html) - (strlen('</div>') + 1) - strlen('<div>'));
return $html;
}
You can then use the above code in the following examples.
Notice how you can either pass in a string as a domain to remove, or you can pass an array of domains, or you can take advantage of func_get_args and pass in an infinite number of parameters.
$str = <<<str
fsdf
<a title="Google Adsense" href="https://www.domainurl.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">fgddf</a>
domain
<a title="Google Adsense" href="https://www.googlead.com/refer/google-adsense/" target="_blank" rel="nofollow noopener">googled</a>
str;
// Example usage
remove_domain($str, 'domainurl.com');
remove_domain($str, 'domainurl.com', 'googlead.com');
remove_domain($str, ['domainurl.com', 'googlead.com']);
Firstly, I have stored your string in a variable, but that is just so that I could utilize it for the answer; replace $str with wherever you get that code from.
The loadHTML function takes an HTML string, but requires one child element - hence why I have wrapped the string in a div.
The while loop will iterate over the anchor elements, and then replace any that match a specified domain with just the content of the anchor tags.
Note, I have left in a comment above this line which you can use instead. This will replace the anchor element with a p tag, which will have a default style of display: block; meaning that your layout won't be likely to break. However, since your expected output is just text nodes, I have left this as just an option.
Live demo
What about:
<a.*? href=\".*www\.googlead\.com.*\">(.*?)<\/a>
So it becomes:
preg_match_all('{<a.*? href=\".*www\.googlead\.com.*\">(.*?)<\/a>}' , $abcont , $allLinksMatchs);
This removes only a tags from www.googlead.com.
You can check the regex result here.
Supposing your HTML is contained in a variable for the following.
The usage of preg_replace should be a better option, here's a function that should help you a bit:
function removeLinkTagsOfDomain($html, $domain) {
// Escape all regex special characters
$domain = preg_quote($domain);
// Search for <a> tags with a href attribute containing the specified domain
$pattern = '/<a .*href=".*' . $domain . '.*".*>(.+)<\/a>/';
// Final replacement (should be the text node of <a> tags)
$replacer = '$1';
return preg_replace($pattern, '$1', $html);
}
// Usage:
$domains = [...];
$html = '...';
foreach ($domains as $d) {
$html = removeLinkTagsOfDomain($html, $d);
}
This question already has answers here:
PHP parse/syntax errors; and how to solve them
(20 answers)
Closed 6 years ago.
I want to extract data from a web source but i am getting error in preg match
<?php
$html=file_get_contents("https://www.instagram.com/p/BJz4_yijmdJ/?taken-by=the.witty");
preg_match("("instapp:owner_user_id" content="(.*)")", $html, $match);
$title = $match[1];
echo $title;
?>
This is the error i get
Parse error: syntax error, unexpected 'instapp' (T_STRING) in
/home/ubuntu/workspace/test.php on line 4
Please help me how can i do this? and i also want to extract more data from the page with regex so is it possible to extract all at once using single code? or i want to use pregmatch many times?
The main problem is that you did not form a valid string literal. Note that PHP supports both single- and double-quoted string literals, and you may use that to your advantage:
preg_match('~"instapp:owner_user_id" content="([^"]*)"~', $html, $match);
While it is OK to use paired (...) symbols as regex delimiters, I'd suggest using a more conventional / or ~/# symbols.
Also, (.*) is a too generic pattern that may match more than you need since . also matches " and * is a greedy modifier, a negated character class is better, ([^"]*) - 0+ chars other than ".
HOWEVER, to parse HTML in PHP, you may use a DOM parser, like DOMDocument.
Here is a sample to get all meta tags that have content attribute and extracting the value of that attribute and saving in an array:
$html = "<html><head><meta property=\"al:ios:url\" content=\"instagram://media?id=1329656989202933577\" /></head><body><span/></body></html>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#content]');
$res = array();
foreach($metas as $m) {
array_push($res, $m->getAttribute('content'));
}
print_r($res);
See the PHP demo
And to only get the id in the content attribute value of a meta tag whose property attribute is equal to al:ios:url, use
$xpath = new DOMXPath($dom);
$metas = $xpath->query('//meta[#property="al:ios:url"]');
$id = "";
if (preg_match('~[?&]id=(\d+)~', $metas->item(0)->getAttribute('content'), $match))
{
$id = $match[1];
}
See another PHP demo
This question already has answers here:
replace all "foo" between ()
(3 answers)
Closed 7 years ago.
I like to replace all \n inside of <pre></pre> with a placeholder. This is what I created:
<?php
$html = "<div>\n<pre id=foo>Foo\n\nBar Bar\nFoo Foo</pre>\n\n</div>";
echo preg_replace("/(<pre[^>]*>[^<]*)(\n)([^<]*<\/pre)/", "$1{NEWLINE}$3", $html);
?>
It replaces only one \n as expected. Do I need to use preg_replace_callback() and a separate function to replace the linebreaks or is it possible with one regex alone?
EDIT: Any solution available for this, too?
$html2 = "<div>\n<pre id=foo><b>Foo\n\n</b>Bar Bar\nFoo Foo</pre>\n\n</div>";
You can do this using a callback as you suggested.
$html = preg_replace_callback('~<pre[^>]*>\K.*?(?=</pre>)~si',
function($m) {
return str_replace(array("\r\n", "\n", "\r"), '{NEWLINE}', $m[0]);
}, $html);
Although, I would recommend using DOM to perform this task.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML
$nodes = $doc->getElementsByTagName('pre');
$find = array("\r\n", "\n", "\r");
foreach ($nodes as $node) {
$node->nodeValue = str_replace($find, '{NEWLINE}', $node->nodeValue);
}
echo $doc->saveHTML();
My question is duplicate:
https://stackoverflow.com/a/5756032/318765
This is what I need:
<?php
echo preg_replace("/(\r\n|\n\r|\n|\r)(?=[^<>]*<\/pre)/", "{NEWLINE}", $html);
?>