PHP Remove URL from string - php

If I have a string that contains a url (for examples sake, we'll call it $url) such as;
$url = "Here is a funny site http://www.tunyurl.com/34934";
How do i remove the URL from the string?
Difficulty is, urls might also show up without the http://, such as ;
$url = "Here is another funny site www.tinyurl.com/55555";
There is no HTML present. How would i start a search if http or www exists, then remove the text/numbers/symbols until the first space?

I re-read the question, here is a function that would work as intended:
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
return implode(' ',$U);
}
$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
Edit #2/#3 (I must be bored). Here is a version that verifies there is a TLD within the URL:
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) { //only preg_match if there is a dot
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
returns:
Cleaned: Here is another funny site badurl.badone but this and and

$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);

Parsing text for URLs is hard and looking for pre-existing, heavily tested code that already does this for you would be better than writing your own code and missing edge cases. For example, I would take a look at the process in Django's urlize, which wraps URLs in anchors. You could port it over to PHP, and--instead of wrapping URLs in an anchor--just delete them from the text.

thanks mike,
update a bit, it return notice error,
'/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i'
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);

$url = "Here is a funny site http://www.tunyurl.com/34934";
$replace = 'http www .com .org .net';
$with = '';
$clean_url = clean($url,$replace,$with);
echo $clean_url;
function clean($url,$replace,$with) {
$replace = explode(" ",$replace);
$new_string = '';
$check = explode(" ",$url);
foreach($check AS $key => $value) {
foreach($replace AS $key2 => $value2 ) {
if (-1 < strpos( strtolower($value), strtolower($value2) ) ) {
$value = $with;
break;
}
}
$new_string .= " ".$value;
}
return $new_string;
}

You would need to write a regular expression to extract out the urls.

Related

php shorten all urls with is.gd api in a string and linkify them

As the title says I'm trying to shorten all urls with is.gd api in a string and linkify them.
function link_isgd($text)
{
$regex = '#(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-~]*(\?\S+)?)?)?)#';
preg_match_all($regex, $text, $matches);
foreach($matches[0] as $longurl)
{
$tiny = file_get_contents('http://isnot.gd/api.php?longurl='.$longurl.'&format=json');
$json = json_decode($tiny, true);
foreach($json as $key => $value)
{
if ($key == 'errorcode')
{
$link = $longurl;
}
else if ($key == 'shorturl')
{
$link = $value;
}
}
}
return preg_replace($regex, ''.$link.'', $text);
}
$txt = 'Some text with links https://www.abcdefg.com/123 blah blah blah https://nooodle.com';
echo link_isgd($txt);
This is what I have got so far, linkifying works and so does shortening if only 1 url is in the string, however if theres 2 or more they all end up the same.
Note: the is.gd was not allowed in post, SO thought I was posting a short link which is not allowed here, so I had to change it to isnot.gd.
Your variable $link is not array so it takes only last assigned value of $link. You can replace preg_replace with str_replace and pass arrays with matches and links.
You can also use preg_replace_callback() and you can pass $matches directly to function which will replace with link. https://stackoverflow.com/a/9416265/7082164
function link_isgd($text)
{
$regex = '#(https?://([-\w\.]+[-\w])+(:\d+)?(/([\w/_\.#-~]*(\?\S+)?)?)?)#';
preg_match_all($regex, $text, $matches);
$links = [];
foreach ($matches[0] as $longurl) {
$tiny = file_get_contents('http://isnot.gd/api.php?longurl=' . $longurl . '&format=json');
$json = json_decode($tiny, true);
foreach ($json as $key => $value) {
if ($key == 'errorcode') {
$links[] = $longurl;
} else if ($key == 'shorturl') {
$links[] = $value;
}
}
}
$links = array_map(function ($el) {
return '' . $el . '';
}, $links);
return str_replace($matches[0], $links, $text);
}

Parse url with pattern in PHP?

How to determine, using regexp or something else in PHP, that following urls match some patterns with tokens (url => pattern):
node/11221 => node/%node
node/38429/news => node/%node/news
album/34234/shadowbox/321023 => album/%album/shadowbox/%photo
Thanks in advance!
Update 1
Wrote the following script:
<?php
$patterns = [
"node/%node",
"node/%node/news",
"album/%album/shadowbox/%photo",
"media/photo",
"blogs",
"news",
"node/%node/players",
];
$url = "node/11111/news";
foreach ($patterns as $pattern) {
$result_pattern = preg_replace("/\/%[^\/]+/x", '/*', $pattern);
$to_replace = ['/\\\\\*/']; // asterisks
$replacements = ['[^\/]+'];
$result_pattern = preg_quote($result_pattern, '/');
$result_pattern = '/^(' . preg_replace($to_replace, $replacements, $result_pattern) . ')$/';
if (preg_match($result_pattern, $url)) {
echo "<pre>" . $pattern . "</pre>";
}
}
?>
Could anyone analyze whether this code is good enough? And also explain why there is so many slashes in this part $to_replace = ['/\\\\\*/']; (regarding the replacement, found exactly such solution on the Internet).
If you know the format beforehand you can use preg_match. For example in the first example, you know %node can only be numbers. Matching multiples should be as as easy as we did it earlier, just store the regex in the array:
$patterns = array(
'node/%node' => '|node/[0-9]+$|',
'node/%node/news' => '|node/[0-9]+/news|',
'album/%album/shadowbox/%photo' => '|album/[0-9]+/shadowbox/[0-9]+|',
'media/photo' => '|media/photo|',
'blogs' => '|blogs|',
'news' => '|news|',
'node/%node/players' => '|node/[0-9]+/players|',
);
$url = "node/11111/players";
foreach ($patterns as $pattern => $regex) {
preg_match($regex, $url, $results);
if (!empty($results)) {
echo "<pre>" . $pattern . "</pre>";
}
}
Notice how I added the question mark $ to end of the first rule, this will insure that it doesn't break into the second rule.
Here is the generic solution to the solution above
<?php
// The url part
$url = "/node/123/hello/strText";
// The pattern part
$pattern = "/node/:id/hello/:test";
// Replace all variables with * using regex
$buffer = preg_replace("(:[a-z]+)", "*", $pattern);
// Explode to get strings at *
// In this case ['/node/','/hello/']
$buffer = explode("*", $buffer);
// Control variables for loop execution
$IS_MATCH = True;
$CAPTURE = [];
for ($i=0; $i < sizeof($buffer); $i++) {
$slug = $buffer[$i];
$real_slug = substr($url, 0 , strlen($slug));
if (!strcmp($slug, $real_slug)) {
$url = substr($url, strlen($slug));
$temp = explode("/", $url)[0];
$CAPTURE[sizeof($CAPTURE)+1] = $temp;
$url = substr($url,strlen($temp));
}else {
$IS_MATCH = False;
}
}
unset($CAPTURE[sizeof($CAPTURE)]);
if($IS_MATCH)
print_r($CAPTURE);
else
print "Not a match";
?>
You can pretty much convert the code above into a function and pass parameters to check against the array case. The first step is regex to convert all variables into * and the explode by *. Finally loop over this array and keep comparing to the url to see if the pattern matches using simple string comparison.
As long as the pattern is fixed, you can use preg_match() function:
$urls = array (
"node/11221",
"node/38429/news",
"album/34234/shadowbox/321023",
);
foreach ($urls as $url)
{
if (preg_match ("|node/([\d]+$)|", $url, $matches))
{
print "Node is {$matches[1]}\n";
}
elseif (preg_match ("|node/([\d]+)/news|", $url, $matches))
{
print "Node is {$matches[1]}\n";
}
elseif (preg_match ("|album/([\d]+)/shadowbox/([\d]+)$|", $url, $matches))
{
print "Album is {$matches[1]} and photo is {$matches[2]}\n";
}
}
For other patterns to match, adjust as necessary.

Regex not quite right

I have a site crawler which displays a list of urls, but the problem is I cannot for the life of me get the last regex quite right.
all urls end up listed as:
http://www.website.org/page1.html&--EFTTIUGJ4ITCyh0Frzb_LFXe_eHw
http://website.net/page2/&--EyqBLeFeCkSfmvA7p0cLrsy1Zm1g
http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg
The Urls can all be different and the only thing which seems static is the & symbol.
How would go abouts getting rid of the & symbol and everything beyond it to the right?
Here is what I have tried with the above results:
function getresults($sterm) {
$html = file_get_html($sterm);
$result = "";
// find all span tags with class=gb1
foreach($html->find('h3[class="r"]') as $ef)
{
$result .= $ef->outertext . '<br>';
}
return $result;
}
function geturl($url) {
$var = $url;
$result = "";
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\/url?q=\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, $matches);
$matches = $matches[1];
foreach($matches as $var)
{
$result .= $var."<br>";
}
echo preg_replace('/sa=U.*?usg=.*?AFQjCN/', "--" , $result);
}
if url are ALWAYS in the same format, use explode :
<?php
$tmp = explode("&", "http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg");
?>
$tmp[0] should content "http://foobar.website.com/page3.php" and
$tmp[1] should content "--E5WRBxuTOQikDIyBczaVXveOdRFg"
A simple way to remove everything after the & character:
$result = substr($result, 0, strpos($result, '&'));

Extracting Twitter hashtag from string in PHP

I need some help with twitter hashtag, I need to extract a certain hashtag as string variable in PHP.
Until now I have this
$hash = preg_replace ("/#(\\w+)/", "#$1", $tweet_text);
but this just transforms hashtag_string into link
Use preg_match() to identify the hash and capture it to a variable, like so:
$string = 'Tweet #hashtag';
preg_match("/#(\\w+)/", $string, $matches);
$hash = $matches[1];
var_dump( $hash); // Outputs 'hashtag'
Demo
I think this function will help you:
echo get_hashtags($string);
function get_hashtags($string, $str = 1) {
preg_match_all('/#(\w+)/',$string,$matches);
$i = 0;
if ($str) {
foreach ($matches[1] as $match) {
$count = count($matches[1]);
$keywords .= "$match";
$i++;
if ($count > $i) $keywords .= ", ";
}
} else {
foreach ($matches[1] as $match) {
$keyword[] = $match;
}
$keywords = $keyword;
}
return $keywords;
}
As i understand you are saying that
in text/pargraph/post you want to show tag with hash sign(#) like this:- #tag
and in url you want to remove # sign because the string after # is not sended to server in request so i have edited your code and try out this:-
$string="www.funnenjoy.com is best #SocialNetworking #website";
$text=preg_replace('/#(\\w+)/','<a href=/hash/$1>$0</a>',$string);
echo $text; // output will be www.funnenjoy.com is best <a href=search/SocialNetworking>#SocialNetworking</a> <a href=/search/website>#website</a>
Extract multiple hashtag to array
$body = 'My #name is #Eminem, I am rap #god, #Yoyoya check it #out';
$hashtag_set = [];
$array = explode('#', $body);
foreach ($array as $key => $row) {
$hashtag = [];
if (!empty($row)) {
$hashtag = explode(' ', $row);
$hashtag_set[] = '#' . $hashtag[0];
}
}
print_r($hashtag_set);
You can use preg_match_all() PHP function
preg_match_all('/(?<!\w)#\w+/', $description, $allMatches);
will give you only hastag array
preg_match_all('/#(\w+)/', $description, $allMatches);
will give you hastag and without hastag array
print_r($allMatches)
You can extract a value in a string with preg_match function
preg_match("/#(\w+)/", $tweet_text, $matches);
$hash = $matches[1];
preg_match will store matching results in an array. You should take a look at the doc to see how to play with it.
Here's a non Regex way to do it:
<?php
$tweet = "Foo bar #hashTag hello world";
$hashPos = strpos($tweet,'#');
$hashTag = '';
while ($tweet[$hashPos] !== ' ') {
$hashTag .= $tweet[$hashPos++];
}
echo $hashTag;
Demo
Note: This will only pickup the first hashtag in the tweet.

remove a part of a URL argument string in php

I have a string in PHP that is a URI with all arguments:
$string = http://domain.com/php/doc.php?arg1=0&arg2=1&arg3=0
I want to completely remove an argument and return the remain string. For example I want to remove arg3 and end up with:
$string = http://domain.com/php/doc.php?arg1=0&arg2=1
I will always want to remove the same argument (arg3), and it may or not be the last argument.
Thoughts?
EDIT: there might be a bunch of wierd characters in arg3 so my prefered way to do this (in essence) would be:
$newstring = remove $_GET["arg3"] from $string;
There's no real reason to use regexes here, you can use string and array functions instead.
You can explode the part after the ? (which you can get using substr to get a substring and strrpos to get the position of the last ?) into an array, and use unset to remove arg3, and then join to put the string back together.:
$string = "http://domain.com/php/doc.php?arg1=0&arg2=1&arg3=0";
$pos = strrpos($string, "?"); // get the position of the last ? in the string
$query_string_parts = array();
foreach (explode("&", substr($string, $pos + 1)) as $q)
{
list($key, $val) = explode("=", $q);
if ($key != "arg3")
{
// keep track of the parts that don't have arg3 as the key
$query_string_parts[] = "$key=$val";
}
}
// rebuild the string
$result = substr($string, 0, $pos + 1) . join($query_string_parts);
See it in action at http://www.ideone.com/PrO0a
preg_replace("arg3=[^&]*(&|$)", "", $string)
I'm assuming the url itself won't contain arg3= here, which in a sane world should be a safe assumption.
$new = preg_replace('/&arg3=[^&]*/', '', $string);
This should also work, taking into account, for example, page anchors (#) and at least some of those "weird characters" you mention but don't seem worried about:
function remove_query_part($url, $term)
{
$query_str = parse_url($url, PHP_URL_QUERY);
if ($frag = parse_url($url, PHP_URL_FRAGMENT)) {
$frag = '#' . $frag;
}
parse_str($query_str, $query_arr);
unset($query_arr[$term]);
$new = '?' . http_build_query($query_arr) . $frag;
return str_replace(strstr($url, '?'), $new, $url);
}
Demo:
$string[] = 'http://domain.com/php/doc.php?arg1=0&arg2=1&arg3=0';
$string[] = 'http://domain.com/php/doc.php?arg1=0&arg2=1';
$string[] = 'http://domain.com/php/doc.php?arg1=0&arg2=1&arg3=0#frag';
$string[] = 'http://domain.com/php/doc.php?arg1=0&arg2=1&arg3=0&arg4=4';
$string[] = 'http://domain.com/php/doc.php';
$string[] = 'http://domain.com/php/doc.php#frag';
$string[] = 'http://example.com?arg1=question?mark&arg2=equal=sign&arg3=hello';
foreach ($string as $str) {
echo remove_query_part($str, 'arg3') . "\n";
}
Output:
http://domain.com/php/doc.php?arg1=0&arg2=1
http://domain.com/php/doc.php?arg1=0&arg2=1
http://domain.com/php/doc.php?arg1=0&arg2=1#frag
http://domain.com/php/doc.php?arg1=0&arg2=1&arg4=4
http://domain.com/php/doc.php
http://domain.com/php/doc.php#frag
http://example.com?arg1=question%3Fmark&arg2=equal%3Dsign
Tested only as shown.

Categories