Find and replace all links in a web page using php/javascript - php

I need to find links in a part of some html code and replace all the links with two different absolute or base domains followed by the link on the page...
I have found a lot of ideas and tried a lot different solutions.. Luck aint on my side on this one.. Please help me out!!
Thank you!!
This is my code:
<?php
$url = "http://www.oxfordreference.com/views/SEARCH_RESULTS.html?&q=android";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<table class="short_results_summary_table">');
$end = strpos($content,'</table>',$start) + 8;
$table = substr($content,$start,$end-$start);
echo "{$table}";
$dom = new DOMDocument();
$dom->loadHTML($table);
$dom->strictErrorChecking = FALSE;
// Get all the links
$links = $dom->getElementsByTagName("a");
foreach($links as $link) {
$href = $link->getAttribute("href");
echo "{$href}";
if (strpos("http://oxfordreference.com", $href) == -1) {
if (strpos("/views/", $href) == -1) {
$ref = "http://oxfordreference.com/views/"+$href;
}
else
$ref = "http://oxfordreference.com"+$href;
$link->setAttribute("href", $ref);
echo "{$link->getAttribute("href")}";
}
}
$table12 = $dom->saveHTML;
preg_match_all("|<tr(.*)</tr>|U",$table12,$rows);
echo "{$rows[0]}";
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
preg_match_all("|<td(.*)</td>|U",$row,$cells);
echo "{$cells}";
}
}
?>
When i run this code i get htmlParseEntityRef: expecting ';' warning for the line where i load the html

var links = document.getElementsByTagName("a"); will get you all the links.
And this will loop through them:
for(var i = 0; i < links.length; i++)
{
links[i].href = "newURLHERE";
}

You should use jQuery - it is excellent for link replacement. Rather than explaining it here. Please look at this answer.
How to change the href for a hyperlink using jQuery

I recommend scrappedcola's answer, but if you dont want to do it on client side you can use regex to replace:
ob_start();
//your HTML
//end of the page
$body=ob_get_clean();
preg_replace("/<a[^>]*href=(\"[^\"]*\")/", "NewURL", $body);
echo $body;
You can use referencing (\$1) or callback version to modify output as you like.

Related

Change 'href' value of a link using PHP and DOM

I would like to change all links in an HTML variable to random ones. Here is my code but something prevents links from being changed:
<?php
$jobTemplateDetails = 'Click!
Click!';
////////////////////// CHANGE ALL LINKS
$linkDom = new DOMDocument;
#$linkDom->loadHTML($jobTemplateDetails);
$allLinks = $linkDom->getElementsByTagName('a');
foreach ($allLinks as $rawLink) {
$longLink = $rawLink->getAttribute('href');
$str = 'abcdefghijklmnopqrstuvwxyz';
$randomChar1 = $str[mt_rand(0, strlen($str)-1)];
$randomChar2 = $str[mt_rand(0, strlen($str)-1)];
$randomChar3 = $str[mt_rand(0, strlen($str)-1)];
$randomChar4 = $str[mt_rand(0, strlen($str)-1)];
$shortURL = mt_rand(1, 9).$randomChar1.mt_rand(1, 9).$randomChar2.$randomChar3.$randomChar4;
$rawLink->setAttribute('href', $shortURL);
}
echo $jobTemplateDetails;
When you echo $jobTemplateDetails; you only show the very first input string, not the DomDocument you manipulate.
Change that to
echo $linkDom->saveHTML();
///OUTPUT:
Click!
Click!
a fiddle: https://3v4l.org/KuCic
and the docs

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?
To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

Loading content from remote site doesn't work, but why?

I'm still working on this catalogue for a client, which loads images from a remote site via PHP and the Simple DOM Parser.
// Code excerpt from http://internetvolk.de/fileadmin/template/res/scrape.php, this is just one case of a select
$subcat = $_GET['subcat'];
$url = "http://pinesite.com/meubelen/index.php?".$subcat."&lang=de";
$html = file_get_html(html_entity_decode($url));
$iframe = $html->find('iframe',0);
$url2 = $iframe->src;
$html->clear();
unset($html);
$fullurl = "http://pinesite.com/meubelen/".$url2;
$html2 = file_get_html(html_entity_decode($fullurl));
$pagecount = 1;
$titles = $html2->find('.tekst');
$images = $html2->find('.plaatje');
$output='';
$i=0;
foreach ($images as $image) {
$item['title'] = $titles[$i]->find('p',0)->plaintext;
$imagePath = $image->find('img',0)->src;
$item['thumb'] = resize("http://pinesite.com".str_replace('thumb_','',$imagePath),array("w"=>225, "h"=>162));
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$imagePath);
$fullurl2 = "http://pinesite.com/meubelen/prog/showpic.php?src=".str_replace('thumb_','',$imagePath)."&taal=de";
$html3 = file_get_html($fullurl2);
$item['size'] = str_replace(' ','',$html3->find('td',1)->plaintext);
unset($html3);
$output[] = $item;
$i++;
}
if (count($html2->find('center')) > 1) {
// ok, multi-page here, let's find out how many there are
$pagecount = count($html2->find('center',0)->find('a'))-1;
for ($i=1;$i<$pagecount; $i++) {
$startID = $i*20;
$newurl = html_entity_decode($fullurl."&beginrec=".$startID);
$html3 = file_get_html($newurl);
$titles = $html3->find('.tekst');
$images = $html3->find('.plaatje');
$a=0;
foreach ($images as $image) {
$item['title'] = $titles[$a]->find('p',0)->plaintext;
$item['image'] = 'http://pinesite.com'.str_replace('thumb_','',$image->find('img',0)->src);
$item['thumb'] = resize($item['image'],array("w"=>225, "h"=>150));
$output[] = $item;
$a++;
}
$html3->clear();
unset ($html3);
}
}
echo json_encode($output);
So what it should do (and does with some categories): Output the images, the titles and the the thumbnails from this page: http://pinesite.com
This works, for example, if you pass it a "?function=images&subcat=antiek", but not if you pass it a "?function=images&subcat=stoelen". I don't even think it's a problem with the remote page, so there has to be an error in my code.
Ehm..trying to state the obvious maybe but 'stoele'?
As it turns out, my code was completely fine, it was a missing space in the HTML of the remote site that got the Simple PHP DOM Parser to not recognize the iframe I was looking for. I fixed it on my end by running a str_replace on the code first to replace the faulty code.
I know it's a dirty solution, but it works :)

Extracting certain portions of HTML from within PHP

Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php

PHP: How to insert a string into matched regex pattern (adding rel="no-follow" to anchor links)

I am writing a commenting system for my website, using PHP.
I want to do the following:
Detect all external links (i.e. anchor tags with source NOT containing the string mywebsite.com) in a comment
Add the string 'rel="no-follow"' to anchor tags identified in step 1 above.
I have an idea for such a function, but I will need some help from more experienced PHP developers so that I'm sure I'm doing things the right way. This is what my first attempt looks like
<?php
function process_comment($comment)
{
$external_url_pattern = "href=[^mywebsite.com]"; //this regex is probably wrong (Help!)
//are there any matches
$matches = array();
preg_match_all($external_url_pattern, $comment, $matches);
foreach($matches as $match)
{
// how do we insert the 'rel="no-follow" string ?
}
}
?>
Would appreciate any comments, pointers and tips in helping me complete this function. Thanks.
Dont know if this will be appropriate, but instead of regex you could do with DOMDocument as well:
$dom = new DOMDocument();
$dom->loadHTML($html);
//Evaluate Anchor tag in HTML
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if($url == "mywebsite.com") {
$href->setAttribute("rel", "no-follow");
}
}
// save html
$html=$dom->saveHTML();
echo $html;
Hope it helps
This is a bit tricky but will do the job.
function process_comment($str)
{
//parses href attribute values into $match
if(preg_match_all('/href\=\"(.*)\"/',$str,$match))
{
foreach($match[1] as $v)
{
//check matched value contains your site as host name
//if not
//adds rel="no-follow" and replaces the link with the attribute
if(!preg_match('#^(?:http://)?(w+\.)?'.$mysite.'(.*)?#i',$v, $m))
{
$rel = $v.'" rel="no-follow';
$str = str_replace($v,$rel,$str);
}
}
}
return $str;
}
process_comment($comment);
You can simply use strstr instead of second preg_match. I used it because I think some urls may contain something like this "http://www.external.com/url.php?v=www.mysite.com"

Categories