How can i get to match the pattern as follows - php

I need to match the pattern
<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>
I tried all three regex patterns but none seem to help me.
preg_match_all("/<a.*(?:[^class=\"item-link\"=]*)class=\"item-link\"(?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/", $content, $tablecontent);
preg_match_all("|/<a (?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/|s", $content, $tablecontent);
preg_match_all("|/<a.+class=\"item-link\".+href=\"(.*)\"[^>]*>\.+<\/a[^>]*>/|m", $content, $tablecontent);
print_r($tablecontent);

Try this:
preg_match('/<a class="item-link" href="([^"]+)">([^<]+)<\/a>/', $content, $matches);

This is the proper way to do this:
$html = '<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xp = new XPath($dom);
$results = $xp->query('//a[class="item-link"]');
foreach ($results as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
... do your stuff here ...
}
Overkill for a single link, but by far the easiest way when dealing with a full HTML page.

Related

Preg_grep pattern to get something between specific things

file contains :
<a href="site.com/" h="
<a href="site3.com/" h="
so i want to echo all urls via pattern with preg_grep or preg_match ?
a pattern to get all between href=" and "
thanks !
Here's an example how to use DOMDocument
$html = '
link1
link2
';
$dom = new DOMDocument();
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
echo $link->getAttribute('href');
}
Look at another example at php.net

How can i remove link using regex in php?

I want to remove all links which matched this domain vnexpress.net in href attribute.
This is a link example:
whatever
This is my code:
$contents = preg_replace('/<a\s*href=\"*vnexpress*\"\s(.*)>(.*)<\/a>/', '', $data->content);
Please help me! Thank you so much!.
You've asked for a regular expression here, but it's not the right tool for parsing HTML.
$doc = new DOMDocument;
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$links = $xpath->query("//a[contains(#href, 'vnexpress.net')]");
foreach ($links as $link) {
$link->parentNode->removeChild($link);
}
echo $doc->saveHTML();
Try this:
$re = "/<a[^>]+href=\"[^\"]*vnexpress.net[^>]+>(.*)<\\/a>/m";
$str = "<a id=\"\" href=\"http://vnexpress.net/whatever\">whatever <b>sss</b> </a>\n<a id=\"\" href=\"http://new.net/whatever\">whatever</a>\n";
$subst = "$1";
$result = preg_replace($re, $subst, $str);
Live demo

Stripping <ins> and <del> tags from <script> tags

I have some code that is generating a diff between two documents, inserting <ins> and <del> tags haphazardly. For the most part it's doing a great job, but every now and then it inserts tags in script, style and the title tags.
Any ideas on how to remove the <del> tags (including the text between them), remove the <ins> tags (but retaining the text within them as part of the original string), however only within those three tags? (title, script and style).
Don't use regex to do this; it sounds like you have to deal with many, many lines. DOMDocument is great.
$dom = new DOMDocument;
$dom->loadHTML($your_html_string);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
$node->removeChild($delNode);
}
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
}
}
Untested, this may or may not work:
$str = preg_replace('/(<script.*?>.*?)<del>.*?</del>(.*?</script>)/im', '$1$2', $str);
It attempts to look within the <script> ... </script> block of the string, and replace any instances of <del>...</del> with empty string.
The following ended up working quite well for me:
$tags = array('script', 'title', 'style');
foreach ($tags as $tag) {
$str = preg_replace_callback(
'/(<' . ($tag) . '\b[^>]*>)(.*?)(<\/' . ($tag) . '>)/is',
function($match) {
$replaced = preg_replace(
array(
'/__Delete-Start__.+__Delete-End__/Uis',
'/__Insert-Start__(.+)__Insert-End__/Uis'
),
array(
'',
'$1'
),
$match[2]
);
return ($match[1]) . ($replaced) . ($match[3]);
},
$str
);
}
While the following didn't end up being my solution, it did get me far and could be useful to others:
$dom = new DOMDocument;
$dom->loadHTML($str);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//script|//title|//style') as $node) {
foreach ($node->getElementsByTagName('del') as $delNode) {
$node->removeChild($delNode);
}
foreach ($node->getElementsByTagName('ins') as $insNode) {
$node->replaceChild($dom->createTextNode($insNode->nodeValue), $insNode);
}
}
$str = (string) $dom->saveXML($dom, LIBXML_NOEMPTYTAG);//$xpath->query('//p')->item(0));
Hope this helps someone else.

How can i identify the relation=NoFOLLOW links

I would like to know how can we identify the Nofollow relation in the URL through PHP REGEX.
<a href="abc.html" rel="NOFOLLOW">How to check NOFOLLOW<a>
Please give me the solution to findout this things
You could try with something such as...
preg_match('/<a.+?rel="nofollow".*?>[\s\S]*?<\/a>/i', $html);
CodePad.
But you are better off using a HTML parser which deals with things that a regex can not.
$dom = new DOMDocument;
$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('rel')) {
$rel = preg_split('/\s+/', strtolower($anchor->getAttribute('rel')));
if (in_array('nofollow', $rel)) {
echo 'This anchor is "nofollow"\'d.';
}
}
}
CodePad.

php regular expression for matching anchor tags

go to the source of this page : www.songs.pk/indian/7days.html
there will be only eight links which start with http://link1
for example : Tune Mera Naam Liya
i want a php regular expression which matches the
http://link1.songs.pk/song1.php?songid=2792
and
Tune Mera Naam Liya
Thanks.
You're better off using something like simplehtmldom to find all links, then find all links with the relevant HTML / href.
Parsing HTML with regex isn't always the best solution, and in your case I feel it will bring you only pain.
$href = 'some_href';
$inner_text = 'some text';
$desired_anchors = array();
$html = file_get_html ('your_file_or_url');
// Find all anchors, returns a array of element objects
foreach($html->find('a') as $anchor) {
if ($a->href = $href && $anchor->innertext == $inner_text) {
$desired_anchors[] = $anchor;
}
}
print_r($desired_anchors);
That should get you started.
Don't use a regex buddy! PHP has a better suited tool for this...
$dom = new DOMDocument;
$dom->loadHTML($str);
$matchedAnchors = array();
$anchors = $dom->getElementsByTagName('a');
$match = 'http://link1';
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('href') AND substr($anchor->getAttribute('href'), 0, strlen($match)) == $match) {
$matchedAnchors[] = $anchor;
}
}
here you go
preg_match_all('~<a .*href="(http://link1\..*)".*>(.*)</a>~Ui',$str,$match,PREG_SET_ORDER);
print_r($match);

Categories