How can i remove link using regex in php? - php

I want to remove all links which matched this domain vnexpress.net in href attribute.
This is a link example:
whatever
This is my code:
$contents = preg_replace('/<a\s*href=\"*vnexpress*\"\s(.*)>(.*)<\/a>/', '', $data->content);
Please help me! Thank you so much!.

You've asked for a regular expression here, but it's not the right tool for parsing HTML.
$doc = new DOMDocument;
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$links = $xpath->query("//a[contains(#href, 'vnexpress.net')]");
foreach ($links as $link) {
$link->parentNode->removeChild($link);
}
echo $doc->saveHTML();

Try this:
$re = "/<a[^>]+href=\"[^\"]*vnexpress.net[^>]+>(.*)<\\/a>/m";
$str = "<a id=\"\" href=\"http://vnexpress.net/whatever\">whatever <b>sss</b> </a>\n<a id=\"\" href=\"http://new.net/whatever\">whatever</a>\n";
$subst = "$1";
$result = preg_replace($re, $subst, $str);
Live demo

Related

Preg_grep pattern to get something between specific things

file contains :
<a href="site.com/" h="
<a href="site3.com/" h="
so i want to echo all urls via pattern with preg_grep or preg_match ?
a pattern to get all between href=" and "
thanks !
Here's an example how to use DOMDocument
$html = '
link1
link2
';
$dom = new DOMDocument();
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
echo $link->getAttribute('href');
}
Look at another example at php.net

Replace the last occurrence of <p> tag in a string

I'm looking to replace the last occurrence of P tag in a string.
$bodytext = preg_replace(strrev("/<p>/"),strrev('<p class="last">'),strrev($bodytext),1);
$bodytext = strrev($bodytext);
This works, but can it be done without using strrev? Is there a regex solution?
Something like :
$bodytext = preg_replace('/<p>.?$/', '<p class="last">', $bodytext);
Any help would be greatly appreciated.
My shortened version:
$dom = new DOMDocument();
$dom->loadHTML($bodytext);
$paragraphs = $dom->getElementsByTagName('p');
$last_p = $paragraphs->item($paragraphs->length - 1);
$last_p->setAttribute("class", "last");
$bodytext = $dom->saveHTML();
Some people will complain that DOMDocument is more verbose for parsing HTML then a regex. But verbosity is okay if it means using the right tool for the job.
$previous_value = libxml_use_internal_errors(TRUE);
$string = '<p>hi, mom</p><p>bye, mom</p>';
$dom = new DOMDocument();
$dom->loadHTML($string);
$paragraphs = $dom->getElementsByTagName('p');
$last_p = $paragraphs->item($paragraphs->length - 1);
$last_p->setAttribute("class", "last");
$new_string = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
libxml_clear_errors();
libxml_use_internal_errors($previous_value);
echo htmlentities($new_string);
// <p>hi, mom</p><p class="last">bye, mom</p>
See it in action
How about using simple html dom?
require_once('simple_html_dom.php');
$string = '<p>hi, mom</p><p>bye, mom</p>';
$doc = str_get_html($string);
$doc->find('p', -1)->class = 'last';
echo $doc;
// <p>hi, mom</p><p class="last">bye, mom</p>

PHP grabbing content between two strings

// get CONTENT from united domains footer
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
// remove spaces from CONTENT
$content = preg_replace('/\s+/', '', $content);
// match all tld tags
$regex = '#target="_parent">.(.*?)</a></li><li>#';
preg_match($regex, $source, $matches);
print_r($matches);
I am wanting to match all of the TLDs:
Each tld is preceded by target="_parent">. and followed by </a></li><li>
I am wanting to end up with an array like array('africa','amsterdam','bnc'...ect ect )
What am I doing wrong here?
NOTE: The second step to remove all the spaces is just to simplify things.
Here's a regular expression that will do it for that page.
\.\w+(?=</a></li>)
REY
PHP
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
preg_match_all('/\.\w+(?=<\/a><\/li>)/m', $content, $matches);
print_r($matches);
PHPFiddle
Here are the results:
.africa, .amsterdam, .bcn, .berlin, .boston, .brussels, .budapest, .gent, .hamburg, .koeln, .london, .madrid, .melbourne, .moscow, .miami, .nagoya, .nyc, .okinawa, .osaka, .paris, .quebec, .roma, .ryukyu, .stockholm, .sydney, .tokyo, .vegas, .wien, .yokohama, .africa, .arab, .bayern, .bzh, .cymru, .kiwi, .lat, .scot, .vlaanderen, .wales, .app, .blog, .chat, .cloud, .digital, .email, .mobile, .online, .site, .mls, .secure, .web, .wiki, .associates, .business, .car, .careers, .contractors, .clothing, .design, .equipment, .estate, .gallery, .graphics, .hotel, .immo, .investments, .law, .management, .media, .money, .solutions, .sucks, .taxi, .trade, .archi, .adult, .bio, .center, .city, .club, .cool, .date, .earth, .energy, .family, .free, .green, .live, .lol, .love, .med, .ngo, .news, .phone, .pictures, .radio, .reviews, .rip, .team, .technology, .today, .voting, .buy, .deal, .luxe, .sale, .shop, .shopping, .store, .eus, .gay, .eco, .hiv, .irish, .one, .pics, .porn, .sex, .singles, .vin, .vip, .bar, .pizza, .wine, .bike, .book, .holiday, .horse, .film, .music, .party, .email, .pets, .play, .rocks, .rugby, .ski, .sport, .surf, .tour, .video
Using the DOM is cleaner:
$doc = new DOMDocument();
#$doc->loadHTMLFile('http://www.uniteddomains.com/index/footer/');
$xpath = new DOMXPath($doc);
$items = $xpath->query('/html/body/div/ul/li/ul/li[not(#class)]/a[#target="_parent"]/text()');
$result = '';
foreach($items as $item) {
$result .= $item->nodeValue; }
$result = explode('.', $result);
array_shift($result);
print_r($result);

How can i identify the relation=NoFOLLOW links

I would like to know how can we identify the Nofollow relation in the URL through PHP REGEX.
<a href="abc.html" rel="NOFOLLOW">How to check NOFOLLOW<a>
Please give me the solution to findout this things
You could try with something such as...
preg_match('/<a.+?rel="nofollow".*?>[\s\S]*?<\/a>/i', $html);
CodePad.
But you are better off using a HTML parser which deals with things that a regex can not.
$dom = new DOMDocument;
$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('rel')) {
$rel = preg_split('/\s+/', strtolower($anchor->getAttribute('rel')));
if (in_array('nofollow', $rel)) {
echo 'This anchor is "nofollow"\'d.';
}
}
}
CodePad.

How can i get to match the pattern as follows

I need to match the pattern
<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>
I tried all three regex patterns but none seem to help me.
preg_match_all("/<a.*(?:[^class=\"item-link\"=]*)class=\"item-link\"(?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/", $content, $tablecontent);
preg_match_all("|/<a (?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/|s", $content, $tablecontent);
preg_match_all("|/<a.+class=\"item-link\".+href=\"(.*)\"[^>]*>\.+<\/a[^>]*>/|m", $content, $tablecontent);
print_r($tablecontent);
Try this:
preg_match('/<a class="item-link" href="([^"]+)">([^<]+)<\/a>/', $content, $matches);
This is the proper way to do this:
$html = '<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xp = new XPath($dom);
$results = $xp->query('//a[class="item-link"]');
foreach ($results as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
... do your stuff here ...
}
Overkill for a single link, but by far the easiest way when dealing with a full HTML page.

Categories