How to wrap every word in spans with PHP? - php

I have a some html paragraphs and I want to wrap every word in . Now I have
$paragraph = "This is a paragraph.";
$contents = explode(' ', $paragraph);
$i = 0;
$span_content = '';
foreach ($contents as $c){
$span_content .= '<span>'.$c.'</span> ';
$i++;
}
$result = $span_content;
The above codes work just fine for normal cases, but sometimes the $paragraph would contains some html tags, for example
$paragraph = "This is an image: <img src='/img.jpeg' /> This is a <a href='/abc.htm'/>Link</a>'";
How can I not wrap "words" inside html tag so that the htmnl tags still works but have the other words wrapped in spans? Thanks a lot!

Some (*SKIP)(*FAIL) mechanism?
<?php
$content = "This is an image: <img src='/img.jpeg' /> ";
$content .= "This is a <a href='/abc.htm'/>Link</a>";
$regex = '~<[^>]+>(*SKIP)(*FAIL)|\b\w+\b~';
$wrapped_content = preg_replace($regex, "<span>\\0</span>", $content);
echo $wrapped_content;
See a demo on ideone.com as well as on regex101.com.
To leave out the Link as well, you could go for:
(?:<[^>]+> # same pattern as above
| # or
(?<=>)\w+(?=<) # lookarounds with a word
)
(*SKIP)(*FAIL) # all of these alternatives shall fail
|
(\b\w+\b)
See a demo for this on on regex101.com.

The short version is you really do not want to attempt this.
The longer version: If you are dealing with HTML then you need an HTML parser. You can't use regexes. But where it becomes even more messy is that you are not starting with HTML, but with an HTML fragment (which may, or may not be well-formed. It might work if Hence you need to use an HTML praser to identify the non-HTML extents, separate them out and feed them into a secondary parser (which might well use regexes) for translation, then replace the translted content back into the DOM before serializing the document.

Related

How to preg_match_all to get the text inside the tags "<h3>" and "<h3> <a/> </h3>"

Hello I am currently creating an automatic table of contents my wordpress web. My reference from
https://webdeasy.de/en/wordpress-table-of-contents-without-plugin/
Problem :
Everything goes well unless in the <h3> tag has an <a> tag link. It make $names result missing.
I see problems because of this regex section
preg_match_all("/<h[3,4](?:\sid=\"(.*)\")?(?:.*)?>(.*)<\/h[3,4]>/", $content, $matches);
// get text under <h3> or <h4> tag.
$names = $matches[2];
I have tried modifying the regex (I don't really understand this)
preg_match_all (/ <h [3,4] (?: \ sid = \ "(. *) \")? (?:. *)?> <a (. *)> (. *) <\ / a> <\ / h [3,4]> /", $content, $matches)
// get text under <a> tag.
$names = $matches[4];
The code above work for to find the text that is in the <h3> <a> a text </a> <h3> tag, but the h3 tag which doesn't contain the <a> tag is a problem.
My Question :
How combine code above?
My expectation is if when the first code result does not appear then it is execute the second code as a result.
Or maybe there is a better solution? Thank you.
Here's a way that will remove any tags inside of header tags
$html = <<<EOT
<h3>Here's an alternative solution</h3> to using regex. <h3>It may <a name='#thing'>not</a></h3> be the most elegant solution, but it works
EOT;
preg_match_all('#<h(.*?)>(.*?)<\/h(.*?)>#si', $html, $matches);
foreach ($matches[0] as $num=>$blah) {
$look_for = preg_quote($matches[0][$num],"/");
$tag = str_replace("<","",explode(">",$matches[0][$num])[0]);
$replace_with = "<$tag>" . strip_tags($matches[2][$num]) . "</$tag>";
$html = preg_replace("/$look_for/", $replace_with,$html,1);
}
echo "<pre>$html</pre>";
The answer #kinglish is the base of this solution, thank you very much. I slightly modify and simplify it according to my question article link. This code worked for me:
preg_match_all('#(\<h[3-4])\sid=\"(.*?)\"?\>(.*?)(<\/h[3-4]>)#si',$content, $matches);
$tags = $matches[0];
$ids = $matches[2];
$raw_names = $matches[3];
/* Clean $rawnames from other html tags */
$clean_names= array_map(function($v){
return trim(strip_tags($v));
}, $raw_names);
$names = $clean_names;

How can I exclude regex href matches of a particular domain?

How can I exclude href matches for a domain (ex. one.com)?
My current code:
$str = 'This string has one link and another link';
$str = preg_replace('~<a href="(https?://[^"]+)".*?>.*?</a>~', '$1', $str);
echo $str; // This string has http://one.com and http://two.com
Desired result:
This string has one link and http://two.com
Using a regular expression
If you're going to use a regular expression to accomplish this task, you can use a negative lookahead. It basically asserts that the part // in the href attribute is not followed by one.com. It's important to note that a lookaround assertion doesn't consume any characters.
Here's how the regular expression would look like:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
Regex Visualization:
Regex101 demo
Using a DOM parser
Even though this is a pretty simple task, the correct way to achieve this would be using a DOM parser. That way, you wouldn't have to change the regex if the format of your markup changes in future. The regex solution will break if the <a> node contains more attribute values. To fix all those issues, you can use a DOM parser such as PHP's DOMDocument to handle the parsing:
Here's how the solution would look like:
$dom = new DOMDocument();
$dom->loadHTML($html); // $html is the string containing markup
$links = $dom->getElementsByTagName('a');
//Loop through links and replace them with their anchor text
for ($i = $links->length - 1; $i >= 0; $i--) {
$node = $links->item($i);
$text = $node->textContent;
$href = $node->getAttribute('href');
if ($href !== 'http://one.com') {
$newTextNode = $dom->createTextNode($text);
$node->parentNode->replaceChild($newTextNode, $node);
}
}
echo $dom->saveHTML();
Live Demo
This should do it:
<a href="(https?://(?!one\.com)[^"]+)".*?>.*?</a>
We use a negative lookahead to make sure that one.com does not appear directly after the https?://.
If you also want to check for some subdomains of one.com, use this example:
<a href="(https?://(?!((www|example)\.)?one\.com)[^"]+)".*?>.*?</a>
Here we optionally check for www. or example. before one.com. This will allow a URL like misc.com, though. If you want to remove all subdomains of one.com, use this:
<a href="(https?://(?!([^.]+\.)?one\.com)[^"]+)".*?>.*?</a>

regex to replace mailto: hrefs but ignore site links

I need some help to tweak this regular expression:
$content = 'more test test Jeff this is a test';
$content = preg_replace("~<a .*?href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>~", "$1", $content);
This expression is to strip the html markup off a mailto link and just return the email (jeff#test.com)
It works fine except for in the example I gave above - because a unlimited number of whitespaces is allowed before the href in the pattern, when a website link is before the mailto link, the regex looks all the way forward until it finds the mailto: in the following link and removes all the content in between.
maybe a fix would be to just limit it to two or three whitespaces after the opening tag so as to not look so far ahead, but i wonder if there is a better solution from people who know regex better than I?
Here is what you should be using...
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('a') as $a) {
if ($a->hasAttribute('href')
AND strpos($href = trim($a->getAttribute('href')), 'mailto:') === 0) {
$textNode = $dom->createTextNode(substr($href, 7));
$parent = $a->parentNode;
$parent->insertBefore($textNode, $a);
$parent->removeChild($a);
}
}
CodePad.
$dom->saveHTML() adds all the HTML boiler plate stuff such as html and body element, you can remove them with...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
$html .= $dom->saveHTML($node);
}
CodePad.
The problem is not to allow any amount of whitespace, that would be working. The problem is you allow one space and any amount of ANY character with your <a .*
If you fix this and allow really only whitespace like this
<a\s+href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>
it seems to work.
See it here at Regexr
But probably you should have a closer look at alex answer (+1 for the example) as this would be the cleaner solution.

preg_replace only OUTSIDE tags ? (... we're not talking full 'html parsing', just a bit of markdown)

What is the easiest way of applying highlighting of some text excluding text within OCCASIONAL tags "<...>"?
CLARIFICATION: I want the existing tags PRESERVED!
$t =
preg_replace(
"/(markdown)/",
"<strong>$1</strong>",
"This is essentially plain text apart from a few html tags generated with some
simplified markdown rules: <a href=markdown.html>[see here]</a>");
Which should display as:
"This is essentially plain text apart from a few html tags generated with some simplified markdown rules: see here"
... BUT NOT MESS UP the text inside the anchor tag (i.e. <a href=markdown.html> ).
I've heard the arguments of not parsing html with regular expressions, but here we're talking essentially about plain text except for minimal parsing of some markdown code.
Actually, this seems to work ok:
<?php
$item="markdown";
$t="This is essentially plain text apart from a few html tags generated
with some simplified markdown rules: <a href=markdown.html>[see here]</a>";
//_____1. apply emphasis_____
$t = preg_replace("|($item)|","<strong>$1</strong>",$t);
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=
// <strong>markdown</strong>.html>[see here]</a>"
//_____2. remove emphasis if WITHIN opening and closing tag____
$t = preg_replace("|(<[^>]+?)(<strong>($item)</strong>)([^<]+?>)|","$1$3$4",$t);
// this preserves the text before ($1), after ($4)
// and inside <strong>..</strong> ($2), but without the tags ($3)
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=markdown.html>
// [see here]</a>"
?>
A string like $item="odd|string" would cause some problems, but I won't be using that kind of string anyway... (probably needs htmlentities(...) or the like...)
You could split the string into tag‍/‍no-tag parts using preg_split:
$parts = preg_split('/(<(?:[^"\'>]|"[^"<]*"|\'[^\'<]*\')*>)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
Then you can iterate the parts while skipping every even part (i.e. the tag parts) and apply your replacement on it:
for ($i=0, $n=count($parts); $i<$n; $i+=2) {
$parts[$i] = preg_replace("/(markdown)/", "<strong>$1</strong>", $parts[$i]);
}
At the end put everything back together with implode:
$str = implode('', $parts);
But note that this is really not the best solution. You should better use a proper HTML parser like PHP’s DOM library. See for example these related questions:
Highlight keywords in a paragraph
Regex / DOMDocument - match and replace text not in a link
First replace any string after a tag, but force your string is after a tag:
$t=preg_replace("|(>[^<]*)(markdown)|i",'$1<strong>$2</strong>',"<null>$t");
Then delete your forced tag:
$show=preg_replace("|<null>|",'',$show);
You could split your string into an array at every '<' or '>' using preg_split(), then loop through that array and replace only in entries not beginning with an '>'. Afterwards you combine your array to an string using implode().
This regex should strip all HTML opening and closing tags: /(<[.*?]>)+/
You can use it with preg_replace like this:
$test = "Hello <strong>World!</strong>";
$regex = "/(<.*?>)+/";
$result = preg_replace($regex,"",$test);
actually this is not very efficient, but it worked for me
$your_string = '...';
$search = 'markdown';
$left = '<strong>';
$right = '</strong>';
$left_Q = preg_quote($left, '#');
$right_Q = preg_quote($right, '#');
$search_Q = preg_quote($search, '#');
while(preg_match('#(>|^)[^<]*(?<!'.$left_Q.')'.$search_Q.'(?!'.$right_Q.')[^>]*(<|$)#isU', $your_string))
$your_string = preg_replace('#(^[^<]*|>[^<]*)(?<!'.$left_Q.')('.$search_Q.')(?!'.$right_Q.')([^>]*<|[^>]*$)#isU', '${1}'.$left.'${2}'.$right.'${3}', $your_string);
echo $your_string;

Get content within a html tag using php and replace it after processing

I have an html (sample.html) like this:
<html>
<head>
</head>
<body>
<div id="content">
<!--content-->
<p>some content</p>
<!--content-->
</div>
</body>
</html>
How do i get the content part that is between the 2 html comment '<!--content-->' using php? I want to get that, do some processing and place it back, so i have to get and put! Is it possible?
esafwan - you could use a regex expression to extract the content between the div (of a certain id).
I've done this for image tags before, so the same rules apply. i'll look out the code and update the message in a bit.
[update] try this:
<?php
function get_tag( $attr, $value, $xml ) {
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';
preg_match($tag_regex,
$xml,
$matches);
return $matches[1];
}
$yourentirehtml = file_get_contents("test.html");
$extract = get_tag('id', 'content', $yourentirehtml);
echo $extract;
?>
or more simply:
preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1];
jim
If this is a simple replacement that does not involve parsing of the actual HTML document, you may use a Regular Expression or even just str_replace for this. But generally, it is not a advisable to use Regex for HTML because HTML is not regular and coming up with reliable patterns can quickly become a nightmare.
The right way to parse HTML in PHP is to use a parsing library that actually knows how to make sense of HTML documents. Your best native bet would be DOM but PHP has a number of other native XML extensions you can use and there is also a number of third party libraries like phpQuery, Zend_Dom, QueryPath and FluentDom.
If you use the search function, you will see that this topic has been covered extensively and you should have no problems finding examples that show how to solve your question.
<?php
$content=file_get_contents("sample.html");
$comment=explode("<!--content-->",$content);
$comment=explode("<!--content-->",$comment[1]);
var_dump(strip_tags($comment[0]));
?>
check this ,it will work for you
Problem is with nested divs
I found solution here
<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
// where "contents" may contain nested <div>s.
// Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{ # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*> # match the "main" class DIV opening tag
( # capture "main" DIV contents into $1
(?: # non-cap group for nesting * quantifier
(?: (?!<div[^>]*>|</div>). )++ # possessively match all non-DIV tag chars
| # or
<div[^>]*>(?1)</div> # recursively match nested <div>xyz</div>
)* # loop however deep as necessary
) # end group 1 capture
</div> # match the "main" class DIV closing tag
}six'; // single-line (dot matches all), ignore case and free spacing modes ON
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(? 1)</div>)*)</div>}si';
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
echo("$matchcount matches found.\n");
// print_r($matches);
for($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i]); // print 1st capture group for match number i
}
} else {
echo('No matches');
}
echo("\n</pre>");
?>
Have a look here for a code example that means you can load a HTML document into SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html
You can then treat it as a normal SimpleXML object.
EDIT: This will only work if you want the content in a tag (e.g. between <div> and </div>)

Categories