Using regex function in a while loop - php

I have a function that gets a specific link from a specific website, and it works, but the problem starts when I try to use this function in a while loop. When I tried that, the links length starts to stack up for some reason.
function getLinks($link) {
$link1 = $link;
$content = file_get_contents($link1);
$content = str_replace("<", "", $content);
$content = str_replace(">", "", $content);
preg_match("~previous page.+?next page~i", $content, $match);
preg_match("~\"(/.+?)\"~i", $match[0], $match);
$link2 = "https://en.wiktionary.org".$match[1];
echo $link1."<br>";
echo $link2."<br>";
return $link2;
}
$firstLink = getLinks("https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages");
Result firstLink = getLinks():
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
^--- See how it works fine when it's like this? Then when I put it in a while loop:
$count = 0;
while ($count < 5) {
$count++;
$firstLink = getLinks($firstLink);
}
The results comes up totally messed up, and the links started to stack up upon each other, like so:
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=BAGSIE%0Abagsie#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bpagefrom=BAGSIE%0Abagsie&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bpagefrom=BAGSIE%0Abagsie&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
https://en.wiktionary.org/w/index.php?title=Category:English_verbs&amp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpagefrom=BAGSIE%0Abagsie&amp%3Bamp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bamp%3Bpagefrom=ACETIFY%0Aacetify&amp%3Bpagefrom=ACETIFY%0Aacetify&pagefrom=ACETIFY%0Aacetify#mw-pages
This is driving me insane, so if anyone know what I did wrong, please, please tell me. Thank you.
Regular function in while loop:
function addOne($num) {
echo $num."<br>";
$num++;
return $num;
}
$num = 0;
$count = 0;
while ($count < 5) {
$count++;
$num = addOne($num);
}
^---Works just fine

Your problem is with HTML entities. I've re-wrote the function to address that issue, repeated URLs and to make it more efficient. You call it with a depth parameter, which would in your case be your while's max.
function getLinks($linkd, $depth, $checked=array()) {
if(!is_array($linkd)) $linkd=array($linkd);
foreach($linkd as $link)
{
if(isset($checked[$link])) continue;
$link1 = $link;
$content = file_get_contents($link1);
$content = str_replace("<", "", $content);
$content = str_replace(">", "", $content);
preg_match("~previous page.+?next page~i", $content, $match);
preg_match("~\"(/.+?)\"~i", $match[0], $match);
$link2 = "https://en.wiktionary.org".$match[1];
echo $link1."<br>";
echo $link2."<br>";
$checked[$link] = true;
if($depth>0)
{
$depth--;
return getLinks(html_entity_decode($link2), $depth, $checked);
}
else
{
return $link2;
}
}
}
$firstLink = "https://en.wiktionary.org/w/index.php?title=Category:English_verbs&pagefrom=AUTOPILOT%0Aautopilot#mw-pages";
$firstLink = getLinks($firstLink, 5);

Related

Alter all a href links in php

Currently working on something where i need to add the UTM tag to all links, got 1/2 minor issues i cant figure out
This is the code im am using, the issue is if a link got a parameter like ?test=test then this refuses to add the utm tags.
The other issue is a minor issue that im not sure would make sence to change, insted of me having to add a url, it could be neat if it added utm tags to ALL a href's by default with out knowing the domain name.
Hope someone can help me out and push me in the right direction.
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
function($matches){
$url_modifier = 'utm=some&medium=stuff';
if (!isset($matches[2])) return $matches[1]."/?$url_modifier";
$q = strpos($matches[2],'?');
if ($q===false) return $matches[1]."?$url_modifier";
if ($q==strlen($matches[2])-1) return $matches[1].$url_modifier;
return $matches[1]."&$url_modifier";
},
$html);
once detected the urls you can use parse_url() and parse_str() to elaborate the url, add utm and medium and rebuild it without caring too much about the content of the get parameters or the hash:
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
function ($matches) {
$link = $matches[0];
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);
$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}
$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';
if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}
return $result;
},
$html
);
As you can see, the code is longer but simpler
Edit
I made some change, searching for every href="xxx" inside the text. If the link is not from add-link.com the script will skip it, otherwise he will try to print it in the best way possible
$html = 'blabla a
a
a
a
a
a
a
a
a
a
a
';
$url_modifier_domain = preg_quote('add-link.com');
$html_text = preg_replace_callback(
'/href="([^"]+)"/i',
function ($matches) {
$link = $matches[1];
// ignoring outer links
if(strpos($link,'add-link.com') === false) return 'href="'.$link.'"';
if (strpos($link, '#') !== false) {
list($link, $hash) = explode('#', $link);
}
$res = parse_url($link);
$result = '';
if (isset($res['scheme'])) {
$result .= $res['scheme'].'://';
} else if(isset($res['host'])) {
$result .= '//';
}
if (isset($res['host'])) {
$result .= $res['host'];
}
if (isset($res['path'])) {
$result .= $res['path'];
} else {
$result .= '/';
}
if (isset($res['query'])) {
parse_str($res['query'], $res['query']);
} else {
$res['query'] = [];
}
$res['query']['utm'] = 'some';
$res['query']['medium'] = 'stuff';
if (count($res['query']) > 0) {
$result .= '?'.http_build_query($res['query']);
}
if (isset($hash)) {
$result .= '#'.$hash;
}
return 'href="'.$result.'"';
},
$html
);
var_dump($html_text);

What is the simplest way to split this string using PHP?

I have the below string in PHP.
:guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP
I need to create these variables from the string:
$nick = guest
$user = lbjpewueqi
$host = AF8A326D.E0B4A40D.F85DC93A.IP
What is the best function to use to do this?
Ideally I would like to create some sort of function so I can pass to it the string and what part I want returned.
For example:
$string = "guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP";
echo stringToPart($string, nick);
guest
echo stringToPart($string, nick);
lbjpewueqi
echo stringToPart($string, host);
AF8A326D.E0B4A40D.F85DC93A.IP
Another version:
function stringToPart($string, $part) {
if (preg_match('/^:(.*)!(.*)#(.*)/', $string, $matches)) {
$nick = $matches[1];
$user = $matches[2];
$host = $matches[3];
return isset($$part) ? $$part : null;
}
}
More strict than preg_split solutions - it checks separators order.
Maybe this code may helpful for you
$p = '/[:!#]/';
$s = ":guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP";
print_r( preg_split( $p, $s ), 1 );
You can declare a function like this:
$s = ":guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP";
function stringToPart($str, $part) {
$pat['nick'] = '/:(.*)!/';
$pat['user'] = '/.*!(.*)#/';
$pat['host'] = '/#(.*)/';
preg_match($pat[$part], $str, $m);
if (count($m) > 1) return $m[1];
return null;
}
echo stringToPart($s,'nick')."\n";
echo stringToPart($s,'user')."\n";
echo stringToPart($s,'host')."\n";
The below should do what you're looking for.
$pattern = "/[:!#]/";
$subject = ":guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP";
print_r(preg_split($pattern, $subject));
The pattern is specifying what characters to split on so you could in theory have any amount of characters here if there were other instance you needed to account for different strings being passed in.
To return the values instead of just printing then to the screen use this:
$pattern = "/[:!#]/";
$subject = ":guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP";
$result = preg_split($pattern, $subject));
$nick = $result[1];
$user = $result[2];
$host = $result[3];
stringToPart(':guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP','nick');
function stringToPart($string, $type){
$result['nick']= substr($string,strpos($string,':')+1,(strpos($string,'!')-strpos($string,':')-1));
$result['user']= substr($string,strpos($string,'!')+1,(strpos($string,'#')-strpos($string,'!')-1));
$result['host']= substr($string,strpos($string,'#')+1);
return $result[$type];
}
<?php
function stringToPart($string, $key)
{
$matches = null;
$returnValue = preg_match('/:(?P<nick>[^!]*)!(?P<user>.*?)#(?P<host>.*)/', $string, $matches);
if (isset($matches[$key]))
{
return $matches[$key];
} else
{
return NULL;
}
}
$string = ':guest!lbjpewueqi#AF8A326D.E0B4A40D.F85DC93A.IP';
echo stringToPart($string, "nick");
echo "<br />";
echo stringToPart($string, "user");
echo "<br />";
echo stringToPart($string, "host");
echo "<br />";
?>

Regex not quite right

I have a site crawler which displays a list of urls, but the problem is I cannot for the life of me get the last regex quite right.
all urls end up listed as:
http://www.website.org/page1.html&--EFTTIUGJ4ITCyh0Frzb_LFXe_eHw
http://website.net/page2/&--EyqBLeFeCkSfmvA7p0cLrsy1Zm1g
http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg
The Urls can all be different and the only thing which seems static is the & symbol.
How would go abouts getting rid of the & symbol and everything beyond it to the right?
Here is what I have tried with the above results:
function getresults($sterm) {
$html = file_get_html($sterm);
$result = "";
// find all span tags with class=gb1
foreach($html->find('h3[class="r"]') as $ef)
{
$result .= $ef->outertext . '<br>';
}
return $result;
}
function geturl($url) {
$var = $url;
$result = "";
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\/url?q=\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, $matches);
$matches = $matches[1];
foreach($matches as $var)
{
$result .= $var."<br>";
}
echo preg_replace('/sa=U.*?usg=.*?AFQjCN/', "--" , $result);
}
if url are ALWAYS in the same format, use explode :
<?php
$tmp = explode("&", "http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg");
?>
$tmp[0] should content "http://foobar.website.com/page3.php" and
$tmp[1] should content "--E5WRBxuTOQikDIyBczaVXveOdRFg"
A simple way to remove everything after the & character:
$result = substr($result, 0, strpos($result, '&'));

php function to extract link from string

i want to extract href link from text or string. i write a little function to do that but this is slow when a string to transform is large. My code is
function spy_linkIntoString_Format($text) {
global $inc_lang; $lang = $inc_lang['tlang_media'];
$it = explode(' ' ,$text);
$result = '';
foreach($it as $jt) {
$a = trim($jt);
if(preg_match('/((?:[\w\d]+\:\/\/)?(?:[\w\-\d]+\.)+[\w\-\d]+(?:\/[\w\-\d]+)*(?:\/|\.[\w\-\d]+)?(?:\?[\w\-\d]+\=[\w\-\d]+\&?)?(?:\#[\w\-\d]*)?)/', $jt)) {
$pros_lis = str_replace('www.','',$jt);
$pros_lis = (strpos($pros_lis, 'http://') === false ? 'http://'. $pros_lis : $pros_lis);
$urlregx = parse_url($pros_lis);
$host_name = (!empty($urlregx['host']) ? $urlregx['host'] : '.com');
if($host_name == 'youtube.com') {
$string_v = $urlregx['query']; parse_str($string_v, $outs); $stID = $outs['v'];
$result .= '<a title="Youtube video" coplay="'.$stID.'" cotype="1" class="media_spy_vr5" href="#"><span class="link_media"></span>'.$lang['vtype_youtube'].'</a> ';
} elseif($host_name == 'vimeo.com') {
$path_s = $urlregx['path']; $patplode = explode("/", $path_s); $stID = $patplode[1];
$result .= '<a title="Vimeo video" coplay="'.$stID.'" cotype="2" class="media_spy_vr5" href="#"><span class="link_media"></span>'.$lang['vtype_vimeo'].'</a> ';
} elseif($host_name == 'travspy.com') {
$result .= '</span>'.$pros_lis.' ';
} else {
$result .= '<span class="jkt_445 c8_big_corner"></span>'.$pros_lis.' ';
}
} else {
$result .= $jt.' ';
}
}
return trim($result);/**/
}
Can i do this run fast?
You should rewrite this to use preg_match_allinstead of splitting the text into words (i.e. drop the explode).
$regex = '/\b((?:[\w\d]+\:\/\/)?(?:[\w\-\d]+\.)+[\w\-\d]+(?:\/[\w\-\d]+)*(?:\/|\.[\w\-\d]+)?(?:\?[\w\-\d]+\=[\w\-\d]+\&?)?(?:\#[\w\-\d]*)?)\b/';
preg_match_all($regex, $text, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$url = $match[0];
// your link generator
}
You seem to be breaking the text into words separated by spaces, and then matching each word against a regular expression. This is very time consuming indeed.
A faster way to do this is to perform the regular expression search on the entire text and then iterate over it's results.
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
foreach($result[0] as $jt){
//do what you normally do with $jt
}

PHP Remove URL from string

If I have a string that contains a url (for examples sake, we'll call it $url) such as;
$url = "Here is a funny site http://www.tunyurl.com/34934";
How do i remove the URL from the string?
Difficulty is, urls might also show up without the http://, such as ;
$url = "Here is another funny site www.tinyurl.com/55555";
There is no HTML present. How would i start a search if http or www exists, then remove the text/numbers/symbols until the first space?
I re-read the question, here is a function that would work as intended:
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
return implode(' ',$U);
}
$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
Edit #2/#3 (I must be bored). Here is a version that verifies there is a TLD within the URL:
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) { //only preg_match if there is a dot
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
returns:
Cleaned: Here is another funny site badurl.badone but this and and
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
Parsing text for URLs is hard and looking for pre-existing, heavily tested code that already does this for you would be better than writing your own code and missing edge cases. For example, I would take a look at the process in Django's urlize, which wraps URLs in anchors. You could port it over to PHP, and--instead of wrapping URLs in an anchor--just delete them from the text.
thanks mike,
update a bit, it return notice error,
'/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i'
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
$url = "Here is a funny site http://www.tunyurl.com/34934";
$replace = 'http www .com .org .net';
$with = '';
$clean_url = clean($url,$replace,$with);
echo $clean_url;
function clean($url,$replace,$with) {
$replace = explode(" ",$replace);
$new_string = '';
$check = explode(" ",$url);
foreach($check AS $key => $value) {
foreach($replace AS $key2 => $value2 ) {
if (-1 < strpos( strtolower($value), strtolower($value2) ) ) {
$value = $with;
break;
}
}
$new_string .= " ".$value;
}
return $new_string;
}
You would need to write a regular expression to extract out the urls.

Categories