How can i identify the relation=NoFOLLOW links - php

I would like to know how can we identify the Nofollow relation in the URL through PHP REGEX.
<a href="abc.html" rel="NOFOLLOW">How to check NOFOLLOW<a>
Please give me the solution to findout this things

You could try with something such as...
preg_match('/<a.+?rel="nofollow".*?>[\s\S]*?<\/a>/i', $html);
CodePad.
But you are better off using a HTML parser which deals with things that a regex can not.
$dom = new DOMDocument;
$dom->loadHTML($html);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('rel')) {
$rel = preg_split('/\s+/', strtolower($anchor->getAttribute('rel')));
if (in_array('nofollow', $rel)) {
echo 'This anchor is "nofollow"\'d.';
}
}
}
CodePad.

Related

How to extract specific type of links from website using php?

I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}

Modify html attribute with php

I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)

How can i get to match the pattern as follows

I need to match the pattern
<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>
I tried all three regex patterns but none seem to help me.
preg_match_all("/<a.*(?:[^class=\"item-link\"=]*)class=\"item-link\"(?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/", $content, $tablecontent);
preg_match_all("|/<a (?:[^href=]*)href=(?:'|\")?(.*)(?:'|\")(?:[^>]*)>(.*)<\/a>/|s", $content, $tablecontent);
preg_match_all("|/<a.+class=\"item-link\".+href=\"(.*)\"[^>]*>\.+<\/a[^>]*>/|m", $content, $tablecontent);
print_r($tablecontent);
Try this:
preg_match('/<a class="item-link" href="([^"]+)">([^<]+)<\/a>/', $content, $matches);
This is the proper way to do this:
$html = '<a class="item-link" href="NEED TO GET THIS PART">AND THIS PART</a>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xp = new XPath($dom);
$results = $xp->query('//a[class="item-link"]');
foreach ($results as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
... do your stuff here ...
}
Overkill for a single link, but by far the easiest way when dealing with a full HTML page.

php regular expression for matching anchor tags

go to the source of this page : www.songs.pk/indian/7days.html
there will be only eight links which start with http://link1
for example : Tune Mera Naam Liya
i want a php regular expression which matches the
http://link1.songs.pk/song1.php?songid=2792
and
Tune Mera Naam Liya
Thanks.
You're better off using something like simplehtmldom to find all links, then find all links with the relevant HTML / href.
Parsing HTML with regex isn't always the best solution, and in your case I feel it will bring you only pain.
$href = 'some_href';
$inner_text = 'some text';
$desired_anchors = array();
$html = file_get_html ('your_file_or_url');
// Find all anchors, returns a array of element objects
foreach($html->find('a') as $anchor) {
if ($a->href = $href && $anchor->innertext == $inner_text) {
$desired_anchors[] = $anchor;
}
}
print_r($desired_anchors);
That should get you started.
Don't use a regex buddy! PHP has a better suited tool for this...
$dom = new DOMDocument;
$dom->loadHTML($str);
$matchedAnchors = array();
$anchors = $dom->getElementsByTagName('a');
$match = 'http://link1';
foreach($anchors as $anchor) {
if ($anchor->hasAttribute('href') AND substr($anchor->getAttribute('href'), 0, strlen($match)) == $match) {
$matchedAnchors[] = $anchor;
}
}
here you go
preg_match_all('~<a .*href="(http://link1\..*)".*>(.*)</a>~Ui',$str,$match,PREG_SET_ORDER);
print_r($match);

How should I get a div's content like this using dom in php?

The div is like this
<div style="width:90%;margin:0 auto;color:#Black;" id="content">
this is text, severaltags
</div>
how should i get the div's content including the tags using dom in php?
Assuming your using PHP5 you can use DOMDocument -- take note that this doesn't provide simple means for retrieving inner html of an element. You can do something along the following:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$items = $dom->getElementsByTagName('div');
if ($items->length)
{
$innerHTML = DOMinnerHTML($items->item(0));
}
echo $innerHTML;
For something this simple, although I don't normally recommend it, I'd use regex:
preg_match('|<div[^>]+>(.*?)</div>|is', $html, $match);
if ($match)
{
echo 'html is: ' . $match[1][0];
}
Something like this?
$document = new DOMDocument();
$document->loadHTML($html);
$element = $document->getElementById('content');
To get the values, you can try something like this
$doc = new DOMDocument();
$doc->loadHTMLFile('link-t0-html-file.php');
$xpath = new DOMXPath($doc);
$element = $xpath->query("//*[#id='content']")->item(0);
echo $element->nodeValue;
if i am not wrong you want this
echo "< div style='width:90%;margin:0 auto;color:#000000;font-size:14px;line-height:24px;'
id='content'>";
echo "this is text, several `<br/>` tags";
echo "< /div>";
just mind it never use double quote (") within double quote ("). use single quote(') within double quote.

Categories