Preg_grep pattern to get something between specific things - php

file contains :
<a href="site.com/" h="
<a href="site3.com/" h="
so i want to echo all urls via pattern with preg_grep or preg_match ?
a pattern to get all between href=" and "
thanks !

Here's an example how to use DOMDocument
$html = '
link1
link2
';
$dom = new DOMDocument();
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
echo $link->getAttribute('href');
}
Look at another example at php.net

Related

How can i get the text from a child node with php DOMDocument

I've been writing a php code to get information from a site, so far i was able to get the href attribute, but i cant find a way to get the text from the child node "span", can someone help me?
html- >
<a class="js-publication" href="publication/247931167">
<span class="publication-title">An approach for textual authoring</span>
</a>
This is how i am currently able to get the href ->
#$dom->loadHTMLFile($curPage);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$class_ = $element->getAttribute('class');
if (0 !== strpos($class_, 'js-publication')) {
$href = $element->getAttribute('href');
if(0 === stripos($href,'publication/')){
echo $href;//link para a publicação;
echo "\n";
}
}
}
You can use DOMXpath
$html = <<< LOL
<a class="js-publication" href="publication/247931167">
<span class="publication-title">An approach for textual authoring</span>
</a>
LOL;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query("//a[#class='js-publication']") as $element){
echo $element->getAttribute('href');
echo $element->textContent;
}
//publication/247931167
//An approach for textual authoring
Or without the for loop, if you just want one element :
echo $xpath->query("//a[#class='js-publication']/span")[0]->textContent;
echo $xpath->query("//a[#class='js-publication']")[0]->getAttribute('href');
Ideone Demo

How can i remove link using regex in php?

I want to remove all links which matched this domain vnexpress.net in href attribute.
This is a link example:
whatever
This is my code:
$contents = preg_replace('/<a\s*href=\"*vnexpress*\"\s(.*)>(.*)<\/a>/', '', $data->content);
Please help me! Thank you so much!.
You've asked for a regular expression here, but it's not the right tool for parsing HTML.
$doc = new DOMDocument;
$doc->loadHTML($html); // load the html
$xpath = new DOMXPath($doc);
$links = $xpath->query("//a[contains(#href, 'vnexpress.net')]");
foreach ($links as $link) {
$link->parentNode->removeChild($link);
}
echo $doc->saveHTML();
Try this:
$re = "/<a[^>]+href=\"[^\"]*vnexpress.net[^>]+>(.*)<\\/a>/m";
$str = "<a id=\"\" href=\"http://vnexpress.net/whatever\">whatever <b>sss</b> </a>\n<a id=\"\" href=\"http://new.net/whatever\">whatever</a>\n";
$subst = "$1";
$result = preg_replace($re, $subst, $str);
Live demo

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

How can i identify the relation=NoFOLLOW links

I would like to know how can we identify the Nofollow relation in the URL through PHP REGEX.
<a href="abc.html" rel="NOFOLLOW">How to check NOFOLLOW<a>
Please give me the solution to findout this things
You could try with something such as...
preg_match('/<a.+?rel="nofollow".*?>[\s\S]*?<\/a>/i', $html);
CodePad.
But you are better off using a HTML parser which deals with things that a regex can not.
$dom = new DOMDocument;
$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('rel')) {
$rel = preg_split('/\s+/', strtolower($anchor->getAttribute('rel')));
if (in_array('nofollow', $rel)) {
echo 'This anchor is "nofollow"\'d.';
}
}
}
CodePad.

How can i get to match the pattern as follows

I need to match the pattern
<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>
I tried all three regex patterns but none seem to help me.
preg_match_all("/<a.*(?:[^class=\"item-link\"=]*)class=\"item-link\"(?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/", $content, $tablecontent);
preg_match_all("|/<a (?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/|s", $content, $tablecontent);
preg_match_all("|/<a.+class=\"item-link\".+href=\"(.*)\"[^>]*>\.+<\/a[^>]*>/|m", $content, $tablecontent);
print_r($tablecontent);
Try this:
preg_match('/<a class="item-link" href="([^"]+)">([^<]+)<\/a>/', $content, $matches);
This is the proper way to do this:
$html = '<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xp = new XPath($dom);
$results = $xp->query('//a[class="item-link"]');
foreach ($results as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
... do your stuff here ...
}
Overkill for a single link, but by far the easiest way when dealing with a full HTML page.

Categories