Check every URL in string to remove links of certain sites

Check every URL in string to remove links of certain sites - php

I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>

One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);

Related

How To Output User Submitted Links On Your Webpage Securely?

I want to allow my website visitors (any Tom, Dick & Harry) submit their links to my webpage for output on my page.
I need to parse user submitted urls before echoing their submitted urls on my page. Need to parse the urls as I won't know what urls they will be submitting nor the structures of their urls.
A user could theoretically visit my page and inject some Javascript code using, for example:
?search=<script>alert('hacked')</script>
You understand my point.
I got to write php script that when users submit their urls, then my php script parses their urls and encodes them by adding urlencode, rawurlencode, intval in the appropriate places before outputting them via htmlspecialchars.
Another wrote this following script. Problem is, it outputs like so:
http%3A%2F%2Fexample.com%2Fcat%2Fsubcat?var_1=value+1&var2=2&this_other=thing&number_is=13
It should output like this:
http://example.com/cat/subcat?var_1=value+1&var2=2&this_other=thing&number_is=13
This is their code ....
Third Party Code:
<?php
function encodedUrl($url){
$query_strings_array = [];
$query_string_parts = [];
// parse URL & get query
$scheme = parse_url($url, PHP_URL_SCHEME);
$host = parse_url($url, PHP_URL_HOST);
$path = parse_url($url, PHP_URL_PATH);
$query_strings = parse_url($url, PHP_URL_QUERY);
// parse query into array
parse_str($query_strings, $query_strings_array);
// separate keys & values
$query_strings_keys = array_keys($query_strings_array);
$query_strings_values = array_values($query_strings_array);
// loop query
for($i = 0; $i < count($query_strings_array); $i++){
$k = urlencode($query_strings_keys[$i]);
$v = $query_strings_values[$i];
$val = is_numeric($v) ? intval($v) : urlencode($v);
$query_string_parts[] = "{$k}={$val}";
}
// re-assemble URL
$encodedHostPath = rawurlencode("{$scheme}://{$host}{$path}");
return $encodedHostPath . '?' . implode('&', $query_string_parts);
}
$url1 = 'http://example.com/cat/subcat?var 1=value 1&var2=2&this other=thing&number is=13';
$url2 = 'http://example.com/autos/cars/list.php?state=california&max_price=50000';
// run urls thru function & echo
// run urls thru function & echo
echo $encoded_url1 = encodedUrl($url1); echo '<br>';
echo $encoded_url2 = encodedUrl($url2); echo '<br>';
?>
So, I changed this of their's:
$encodedHostPath = rawurlencode("{$scheme}://{$host}{$path}");
to this of mine (my amendment):
$encodedHostPath = rawurlencode("{$scheme}").'://'.rawurlencode("{$host}").$path;
And it seems to be working. As it's outputting:
http://example.com/cat/subcat?var_1=value+1&var2=2&this_other=thing&number_is=13
QUESTION 1:
But I am not sure if I put the raw_urlencode() in the right places or not and so best you check.
Also, should not the $path be inside raw_urlencode like so ?
raw_urlencode($path)
Note however that:
raw_urlencode($path)
doesn't output right.
QUESTION 2:
I FURTHER updated their code to a new VERSION and it's not outputting right. Why is that ? Where am I going wrong ?
All I did was add a few lines.
This is my update (NEW VERSION) which outputs wrong. Outputs like this:
http%3A%2F%2Fexample.com%2Fcat%2Fsubcat?var_1=value+1&var2=2&this_other=thing&number_is=13
I added a few lines of my own at the bottom of their code.
MY UPDATE (NEW VERSION):
<?php
function encodedUrledited($url){
$query_strings_array = [];
$query_string_parts = [];
// parse URL & get query
$scheme = parse_url($url, PHP_URL_SCHEME);
$host = parse_url($url, PHP_URL_HOST);
$path = parse_url($url, PHP_URL_PATH);
$query_strings = parse_url($url, PHP_URL_QUERY);
// parse query into array
parse_str($query_strings, $query_strings_array);
// separate keys & values
$query_strings_keys = array_keys($query_strings_array);
$query_strings_values = array_values($query_strings_array);
// loop query
for($i = 0; $i < count($query_strings_array); $i++){
$k = urlencode($query_strings_keys[$i]);
$v = $query_strings_values[$i];
$val = is_numeric($v) ? intval($v) : urlencode($v);
$query_string_parts[] = "{$k}={$val}";
}
// re-assemble URL
$encodedHostPath = rawurlencode("{$scheme}").'://'.rawurlencode("{$host}").$path;
return $encodedHostPath . '?' .implode('&', $query_string_parts);
}
if(!ISSET($_POST['url1']) && empty($_POST['url1']) && !ISSET($_POST['url2']) && empty($_POST['url2']))
{
//Default Values for Substituting empty User Inputs.
$url1 = 'http://example.com/cat/subcat?var 1=value 1&var2=2&this other=thing&number is=138';
$url2 = 'http://example.com/autos/cars/list.php?state=california&max_price=500008';
}
else
{
//User has made following inputs...
$url1 = $_POST['url1'];
$url2 = $_POST['url2'];
//Encode User's Url inputs. (Add rawurlencode(), urlencode() and intval() in user's submitted url where appropriate).
$encoded_url1 = encodedUrledited($url1);
$encoded_url2 = encodedUrledited($url2);
}
echo $link1 = '<a href=' .htmlspecialchars($encoded_url1) .'>' .htmlspecialchars($encoded_url1) .'</a>';
echo '<br/>';
echo $link2 = '<a href=' .htmlspecialchars($encoded_url2) .'>' .htmlspecialchars($encoded_url2) . '</a>';
echo '<br>';
?>
This thread is really about the 2nd code. My update.
Thank You!

I fixed my code.
Answering my own question.
Fixed Code:
function encodedUrledited($url){
$query_strings_array = [];
$query_string_parts = [];
// parse URL & get query
$scheme = parse_url($url, PHP_URL_SCHEME);
$host = parse_url($url, PHP_URL_HOST);
$path = parse_url($url, PHP_URL_PATH);
$query_strings = parse_url($url, PHP_URL_QUERY);
// parse query into array
parse_str($query_strings, $query_strings_array);
// separate keys & values
$query_strings_keys = array_keys($query_strings_array);
$query_strings_values = array_values($query_strings_array);
// loop query
for($i = 0; $i < count($query_strings_array); $i++){
$k = $query_strings_keys[$i];
$key = is_numeric($k) ? intval($k) : urlencode($k);
$v = $query_strings_values[$i];
$val = is_numeric($v) ? intval($v) : urlencode($v);
$query_string_parts[] = "{$key}={$val}";
}
// re-assemble URL
$encodedHostPath = rawurlencode($scheme).'://'.rawurlencode($host).$path;
$encodedHostPath .= '?' .implode('&', $query_string_parts);
return $encodedHostPath;
}
if(!ISSET($_POST['url1']) && empty($_POST['url1']) && !ISSET($_POST['url2']) && empty($_POST['url2']))
{
//Default Values for Substituting empty User Inputs.
$url1 = 'http://example.com/cat/subcat?var 1=value 1&var2=2&this other=thing&number is=138';
$url2 = 'http://example.com/autos/cars/list.php?state=california&max_price=500008';
}
else
{
//User has made following inputs...
$url1 = $_POST['url1'];
$url2 = $_POST['url2'];
//Encode User's Url inputs. (Add rawurlencode(), urlencode() and intval() in user's submitted url where appropriate).
}
$encoded_url1 = encodedUrledited($url1);
$encoded_url2 = encodedUrledited($url2);
$link1 = '<a href=' .htmlspecialchars($encoded_url1) .'>' .htmlspecialchars($encoded_url1) .'</a>';
$link2 = '<a href=' .htmlspecialchars($encoded_url2) .'>' .htmlspecialchars($encoded_url2) . '</a>';
echo $link1; echo '<br/>';
echo $link2; echo '<br/>';
?>
These 2 following lines were supposed to be outside the ELSE. They weren't. Hence all the issue. Moved them outside the ELSE and now script working fine.
$encoded_url1 = encodedUrledited($url1);
$encoded_url2 = encodedUrledited($url2);

Split URL in to page and folder variables using php?

I am currently using the following code to achieve a page, section and class variable from the url.
$domain = 'http://' . $_SERVER['HTTP_HOST'];
$path = $_SERVER['REQUEST_URI'];
$url = $domain . $path;
// page + section + class
$page = basename($url);
$page = $class = str_replace('.php','',$page);
$page = str_replace('-',' ',$page);
if ($path == "/") {
$section = $class = "home";
} else if (basename(dirname($url),"/") == $_SERVER['HTTP_HOST']) {
$section = $page;
} else {
$section = basename(dirname($url),"/");
$section = str_replace('-',' ',$section);
$class = basename(dirname($url),"/") . " " . $class;
}
For example if the url is http://www.mydomain.co.uk/about/ the code will return the following variables:
$page = "about"
$section = "about"
$class = "about"
For http://www.mydomain.co.uk/about/general-info/
$page = "general info"
$section = "about"
$class = "about general-info"
But when I add more depth for example For http://www.mydomain.co.uk/about/general-info/history/ to code produces:
$page = "history"
$section = "general info"
$class = "general-info history"
where ideally I need it to output the following:
$page = "history"
$section = "about general info"
$class = "about general-info history"
or breakdown the sections into as many as needed for example:
$section1 = "about"
$section2 = "general-info"
Hopefully someone can help. If anything is unclear please ask.

What about using more general splitting ?
// Url: http://www.mydomain.co.uk/about/general-info/history/
$slices = explode('/', $_SERVER['REQUEST_URI']);
// $slices == ['about', 'general-info', 'history']
Then do your routing as you want:
$class = implode(' ', $slices);
$section = $slices[1];
// etc.

So you want an URI-scheme as follows?
http://<domain>/<section-1>/<section-2>/.../<page>/
Special cases:
(1) http://<domain>/ -- page is empty, section is "home"
(2) http://<domain>/<section>/ -- page and section are '<section>'
Instead of relying on basename(), you should split your string into an array using explode() or preg_split(). You can use REQUEST_URI directly since the domain name does not give any extra information for your sections and page.
Once you have an array, you can easily count() the number of path components, handle special cases for empty and size-one paths, and so on. In the following example, I use array_pop() to extract the last part of the path to separate the page from the sections. Since you seem to desire space-separated strings for sections and page, I use implode() to join the arrays back into a string.
// No need for the domain stuff!
$path = $_SERVER['REQUEST_URI'];
// Split at '/', could use explode() but the PREG_SPLIT_NO_EMPTY flag is
// very handy since it handles "//" and "/" at start/end.
$tokens = preg_split('#/#', $path, -1, PREG_SPLIT_NO_EMPTY);
if (count($tokens) == 0) {
// Special case 1
$page = "";
$section = $class = 'home';
} elseif (count($tokens) == 1) {
// Specical case 2
$page = $section = $class = $tokens[0];
} else {
// Class contains all tokens.
$class = implode(' ', $tokens);
// The last part is the page.
$page = array_pop($tokens);
// Everything else are sections.
$sections = implode(' ', $tokens);
}
// You seem to want spaces for dashes in the section:
$section = str_replace('-', ' ', $section);

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}

Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"

Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.

Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link

Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...

Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}

<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>

Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);

WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}

a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}

Strip array of url and other characters, show only post name

The xml is like this: (wordpress url's) I want to strip them and get only the posts words.
http://www.site1.com/dir/this-is-page/
http://www.site2.com/this-is-page
How do i strip the url's and get only "this is page" (without the rest of the urls, and the "-") if i have two diffrent types of urls; one with dir and one without dir? Sample code bellow:
$feeds = array('http://www.site1.com/dir/feed.xml', 'http://www.site2.com/feed.xml');
foreach($feeds as $feed)
{
$xml = simplexml_load_file($feed);
foreach( $xml->url as $url )
{
$loc = $url->loc;
echo $loc;
$locstrip = explode("/",$loc);
$locstripped = $locstrip[4];
echo '<br />';
echo $locstripped;
echo '<br />';
mysql_query("TRUNCATE TABLE interlinks");
mysql_query("INSERT INTO interlinks (title, url) VALUES ('$locstripped', '$loc')");
}
}
?>
TY

Ty guys, did it like this:
$urlstrip = basename($loc);
$linestrip = str_replace(array('-','_'), ' ', $urlstrip);

You want only the last segment of the URL?
Try something like this.
$url = trim('http://www.site1.com/dir/this-is-page/', '/');
$url = explode('/', $url);
$url = array_pop($url);
$url = str_replace(array('-','_'), ' ', $url);

It's not very elegant... but it works.
replace
$locstripped = $locstrip[4];
with
$locstripped = $locstrip[count($loc) - 1];
if(!$locstripped)
$locstripped = $locstrip[count($loc) - 2];
$locstripped = str_replace('-', ' ', $locstripped);

PHP Remove URL from string

If I have a string that contains a url (for examples sake, we'll call it $url) such as;
$url = "Here is a funny site http://www.tunyurl.com/34934";
How do i remove the URL from the string?
Difficulty is, urls might also show up without the http://, such as ;
$url = "Here is another funny site www.tinyurl.com/55555";
There is no HTML present. How would i start a search if http or www exists, then remove the text/numbers/symbols until the first space?

I re-read the question, here is a function that would work as intended:
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
return implode(' ',$U);
}
$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
Edit #2/#3 (I must be bored). Here is a version that verifies there is a TLD within the URL:
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) { //only preg_match if there is a dot
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
returns:
Cleaned: Here is another funny site badurl.badone but this and and

$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);

Parsing text for URLs is hard and looking for pre-existing, heavily tested code that already does this for you would be better than writing your own code and missing edge cases. For example, I would take a look at the process in Django's urlize, which wraps URLs in anchors. You could port it over to PHP, and--instead of wrapping URLs in an anchor--just delete them from the text.

thanks mike,
update a bit, it return notice error,
'/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i'
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);

$url = "Here is a funny site http://www.tunyurl.com/34934";
$replace = 'http www .com .org .net';
$with = '';
$clean_url = clean($url,$replace,$with);
echo $clean_url;
function clean($url,$replace,$with) {
$replace = explode(" ",$replace);
$new_string = '';
$check = explode(" ",$url);
foreach($check AS $key => $value) {
foreach($replace AS $key2 => $value2 ) {
if (-1 < strpos( strtolower($value), strtolower($value2) ) ) {
$value = $with;
break;
}
}
$new_string .= " ".$value;
}
return $new_string;
}

You would need to write a regular expression to extract out the urls.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Check every URL in string to remove links of certain sites - php

Related

How To Output User Submitted Links On Your Webpage Securely?

Split URL in to page and folder variables using php?

How to add rel="nofollow" to links with preg_replace()

Strip array of url and other characters, show only post name

PHP Remove URL from string

Categories

Resources