how can i replace all the anchors with each anchor text . my code is
$body='<p>The man was dancing like a little boy while all kids were watching ... </p>';
i want the result to be :
<p>The man was dancing like a little boy while all kids were watching ... </p>
i used :
$body= preg_replace('#<a href="https?://(?:.+\.)?ok.co.*?>.*?</a>#i', '$1', $body);
and result is :
<p>The man was while all kids were watching ... </p>
Try this
$body='<p>The man was dancing like a little boy while all kids were watching ... </p>';
echo preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $body);
Without regexes.....
<?php
$d = new DOMDocument();
$d->loadHTML('<p>The man was dancing like a little boy while all kids were watching ... </p>');
$x = new DOMXPath($d);
foreach($x->query('//a') as $anchor){
$url = $anchor->getAttribute('href');
$domain = parse_url($url,PHP_URL_HOST);
if($domain == 'www.example.com'){
$anchor->parentNode->replaceChild(new DOMText($anchor->textContent),$anchor);
}
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
echo get_inner_html($x->query('//body')[0]);
can use this code:
regex : /< a.*?>|<a.*?>|<\/a>/g
$body='<p>The man was dancing like a little boy while all kids were watching ... </p>';
echo preg_replace('/< a.*?>|<a.*?>|<\/a>/', ' ', $body);
test and show example match word: https://regex101.com/r/mgYjoB/1
You could simply use strip_tags() and htmlspecialchars() here.
strip_tags - Strip HTML and PHP tags from a string
htmlspecialchars - Convert special characters to HTML entities
Step 1: Use strip_tags() to strip all tags except the <p> tag.
Step 2: Since we need to obtain the string along with the HTML tags, we need to use htmlspecialchars().
echo htmlspecialchars(strip_tags($body, '<p>'));
When there's already an in-built PHP function, I think it's better and more compact to use that instead of using preg_replace
Related
Hello I am currently creating an automatic table of contents my wordpress web. My reference from
https://webdeasy.de/en/wordpress-table-of-contents-without-plugin/
Problem :
Everything goes well unless in the <h3> tag has an <a> tag link. It make $names result missing.
I see problems because of this regex section
preg_match_all("/<h[3,4](?:\sid=\"(.*)\")?(?:.*)?>(.*)<\/h[3,4]>/", $content, $matches);
// get text under <h3> or <h4> tag.
$names = $matches[2];
I have tried modifying the regex (I don't really understand this)
preg_match_all (/ <h [3,4] (?: \ sid = \ "(. *) \")? (?:. *)?> <a (. *)> (. *) <\ / a> <\ / h [3,4]> /", $content, $matches)
// get text under <a> tag.
$names = $matches[4];
The code above work for to find the text that is in the <h3> <a> a text </a> <h3> tag, but the h3 tag which doesn't contain the <a> tag is a problem.
My Question :
How combine code above?
My expectation is if when the first code result does not appear then it is execute the second code as a result.
Or maybe there is a better solution? Thank you.
Here's a way that will remove any tags inside of header tags
$html = <<<EOT
<h3>Here's an alternative solution</h3> to using regex. <h3>It may <a name='#thing'>not</a></h3> be the most elegant solution, but it works
EOT;
preg_match_all('#<h(.*?)>(.*?)<\/h(.*?)>#si', $html, $matches);
foreach ($matches[0] as $num=>$blah) {
$look_for = preg_quote($matches[0][$num],"/");
$tag = str_replace("<","",explode(">",$matches[0][$num])[0]);
$replace_with = "<$tag>" . strip_tags($matches[2][$num]) . "</$tag>";
$html = preg_replace("/$look_for/", $replace_with,$html,1);
}
echo "<pre>$html</pre>";
The answer #kinglish is the base of this solution, thank you very much. I slightly modify and simplify it according to my question article link. This code worked for me:
preg_match_all('#(\<h[3-4])\sid=\"(.*?)\"?\>(.*?)(<\/h[3-4]>)#si',$content, $matches);
$tags = $matches[0];
$ids = $matches[2];
$raw_names = $matches[3];
/* Clean $rawnames from other html tags */
$clean_names= array_map(function($v){
return trim(strip_tags($v));
}, $raw_names);
$names = $clean_names;
First of all, I found some threads here on SO, for example here, but it's not exactly what I am looking for.
Here is a sample of text that I have:
Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook
The desired output:
2012-12-13
Peter Novak
books,cinema,facebook
I need to save this information into our database, but I don't know, how to detect between the <b> tags the value (eg. Date) and then immediately the value (in this case : 2012-12-13)...
I would be grateful for every help with this, thank you!
Since there's not much DOM to traverse, there's not much a DOM traversal tool can do with this.
This should work:
1) Remove everything before the b tag.
2) Remove the b tags. A DOM traversal tool can do this, but if they are pure text, even a regex can do it, and it can remove the colon and the subsequent whitespace in the same pass: <b\s*>[^<]+</b\s*>:\s*
3) Change sequences of br tags to bare newlines (do you really want to?). The DOM traversal tool can do this, but so can regexes: (?:<br\s*/?>)+
$html = preg_replace('#^[^<]+#', "", $html);
$html = preg_replace('#<b\s*>[^<]+</b\s*>:\s*#', "", $html);
$html = preg_replace('#(?:<br\s*/?>)+#', "\n", $html);
If <b>Date</b>, <b>Name</b>, <b>Hobby</b> and the <br />'s will always be there in that way, I suggest you use strpos() and substr().
For instance, to get the date:
// Get start position, +13 because of "<b>Date</b>: "
$dateStartPos = strpos($yourText, "<b>Date</b>") + 13;
// Get end position, use dateStartPos as offset
$dateEndPos = strpos($yourText, "<br />", $dateStartPos);
// Cut out the date, the length is the end position minus the start position
$date = substr($yourText, $dateStartPos, ($dateEndPos - $dateStartPos));
Assuming that the format is consistent, then explode can work for you:
<?php
$text = "Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook";
$tokenized = explode(': ', $text);
$tokenized[1] = explode("<br", $tokenized[1]);
$tokenized[2] = explode("<br", $tokenized[2]);
$tokenized[3] = explode("<br", $tokenized[3]);
$date = $tokenized[1][0];
$name = $tokenized[2][0];
$hobby = $tokenized[3][0];
echo $date;
echo $name;
echo $hobby;
?>
Using PHP Simple HTML DOM Parser you can achieve this easily (just like jQuery)
include('simple_html_dom.php');
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');
Or
$html = file_get_html('http://your_page.com/');
then
foreach($html->find('text') as $t){
if(substr($t, 0, 1)==':')
{
// do whatever you want
echo substr($t, 1).'<br />';
}
}
The output of the example is given below
2012-12-13
Peter Novak
books,cinema,facebook
I need some help to tweak this regular expression:
$content = 'more test test Jeff this is a test';
$content = preg_replace("~<a .*?href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>~", "$1", $content);
This expression is to strip the html markup off a mailto link and just return the email (jeff#test.com)
It works fine except for in the example I gave above - because a unlimited number of whitespaces is allowed before the href in the pattern, when a website link is before the mailto link, the regex looks all the way forward until it finds the mailto: in the following link and removes all the content in between.
maybe a fix would be to just limit it to two or three whitespaces after the opening tag so as to not look so far ahead, but i wonder if there is a better solution from people who know regex better than I?
Here is what you should be using...
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('a') as $a) {
if ($a->hasAttribute('href')
AND strpos($href = trim($a->getAttribute('href')), 'mailto:') === 0) {
$textNode = $dom->createTextNode(substr($href, 7));
$parent = $a->parentNode;
$parent->insertBefore($textNode, $a);
$parent->removeChild($a);
}
}
CodePad.
$dom->saveHTML() adds all the HTML boiler plate stuff such as html and body element, you can remove them with...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $node) {
$html .= $dom->saveHTML($node);
}
CodePad.
The problem is not to allow any amount of whitespace, that would be working. The problem is you allow one space and any amount of ANY character with your <a .*
If you fix this and allow really only whitespace like this
<a\s+href=[\'|\"]mailto:(.*?)[\'|\"].*?>.*?</a>
it seems to work.
See it here at Regexr
But probably you should have a closer look at alex answer (+1 for the example) as this would be the cleaner solution.
I have a string that has some hyperlinks inside. I want to match with regex only certain link from all of them. I can't know if the href or the class comes first, it may be vary.
This is for example a sting:
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
I want to select from the aboce string only the one that has the class nextpostslink
So, the match in this example should return this -
»eee
This regex is the most close I could get -
/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/
But it is selecting the links from the start of the string.
I think my problem is in the (.*) , but I can't figure out how to change this to select only the needed link.
I would appreciate your help.
It's much better to use a genuine HTML parser for this. Abandon all attempts to use regular expressions on HTML.
Use PHP's DOMDocument instead:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
foreach ($dom->getElementsByTagName('a') as $link) {
$classes = explode(' ', $link->getAttribute('class'));
if (in_array('nextpostslink', $classes)) {
// $link has the class "nextpostslink"
}
}
Not sure if that's what you're but anyway: it's a bad idea to parse html with regex. Use a xpath implementation in order to reach the desired elements. The following xpath expression would give you all the 'a' elements with class "nextpostlink" :
//a[contains(#class,"nextpostslink")]
There are loads of xpath info around, since you didn't mention your programming language here goes a quick xpath tutorial using java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
Edit:
php + xpath + html: http://dev.juokaz.com/php/web-scraping-with-php-and-xpath
This would work in php:
/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m
This is of course assuming that the class attribute always comes after the href attribute.
This is a code snippet:
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
echo "URL: " . $matches[2] . "\n";
echo "Text: " . $matches[6] . "\n";
}
I would however suggest first matching the link and then getting the url so that the order of the attributes doesn't matter:
<?php
$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>
»eee<span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;
$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$link = $matches[0];
$text = $matches[4];
$regexp = "/href=(\"|')([^'\"]*)(\"|')/";
$matches = array();
if(preg_match($regexp, $html, $matches)) {
$url = $matches[2];
echo "URL: $url\n";
echo "Text: $text\n";
}
}
You could of course extend the regexp by matching one of the both variants (class first vs href first) but it would be very long and I don't think it would be a performance increase.
Just as a proof of concept I created a regexp that doesn't care about the order:
/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m
The text will be in group 12 and the URL will be in either group 3 or group 10 depending on the order.
As the question is to get it by regex, here is how <a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>.
It doesn't matter in which order are the attributs and it also consider simple or double quotes.
Check the regex online: https://regex101.com/r/DX03KD/1/
I replaced the (.*) with [^'"]+ as follows:
<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>
Note: I tried this with RegEx Buddy so I didnt need to escape the <>'s or /
Say I have the following text
..(content).............
<A HREF="http://foo.com/content" >blah blah blah </A>
...(continue content)...
I want to delete the link and I want to delete the tag (while keeping the text in between). How do I do this with a regular expression (since the URLs will all be different)
Much thanks
This will remove all tags:
preg_replace("/<.*?>/", "", $string);
This will remove just the <a> tags:
preg_replace("/<\\/?a(\\s+.*?>|>)/", "", $string);
Avoid regular expressions whenever you can, especially when processing xml. In this case you can use strip_tags() or simplexml, depending on your string.
<?php
//example to extract the innerText from all anchors in a string
include('simple_html_dom.php');
$html = str_get_html('<A HREF="http://foo.com/content" >blah blah blah </A><A HREF="http://foo.com/content" >blah blah blah </A>');
//print the text of each anchor
foreach($html->find('a') as $e) {
echo $e->innerText;
}
?>
See PHP Simple DOM Parser.
Not pretty but does the job:
$data = str_replace('</a>', '', $data);
$data = preg_replace('/<a[^>]+href[^>]+>/', '', $data);
strip_tags() can also be used.
Please see examples here.
$pattern = '/href="([^"]*)"/';
I use this to replace the anchors with a text string...
function replaceAnchorsWithText($data) {
$regex = '/(<a\s*'; // Start of anchor tag
$regex .= '(.*?)\s*'; // Any attributes or spaces that may or may not exist
$regex .= 'href=[\'"]+?\s*(?P<link>\S+)\s*[\'"]+?'; // Grab the link
$regex .= '\s*(.*?)\s*>\s*'; // Any attributes or spaces that may or may not exist before closing tag
$regex .= '(?P<name>\S+)'; // Grab the name
$regex .= '\s*<\/a>)/i'; // Any number of spaces between the closing anchor tag (case insensitive)
if (is_array($data)) {
// This is what will replace the link (modify to you liking)
$data = "{$data['name']}({$data['link']})";
}
return preg_replace_callback($regex, array('self', 'replaceAnchorsWithText'), $data);
}
use str_replace