Strip All Urls From A Mixed String ( php )

Strip All Urls From A Mixed String ( php ) - php

i reposted this question because i didn't find a good answer.
i have a string which can contains text with urls.
i want a function to strip all urls from this string and just let the text.
by example the string can contains like this :
1) hey take a look here : http://xxx.xxx/545df5 this is nice!
2) hey take a look here : http://www.xxx.xxx/545df5 this is nice!
3) hey take a look here : xxx.xxx/545df5 this is nice!
4) hey take a look here : www.xxx.xxx/545df5 this is nice!
Thanks

Regular expression for URL and how to use regular expression with php should help you.

What you really need is a solid regex to find urls in a string and you can preg_replace that pattern with nothing. I can tell you though that tracking down a regex like that is not easy. Depending on the variations in the urls you're looking for (i.e. http:// vs https:// vs ftp://) You could run into real trouble trying to account for all that.
Here is a page that I found to be a good start though.

Regex is the way to go as was discussed prior. Finding one isn't that terribly hard (google: url regex pattern) One example returned is here
http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm
I would also recommend you test your regex using one of the many fine online regex testers. My favorite (for non-java) is
http://www.regextester.com/

This function should do it(assuming your strings are seperated by space " "):
function isValidURL($url) {
return preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $url);
}
function cleanUpUrls($urls) {
$urlArray = explode(' ',$urls);
$resultArray = array();
foreach ($urlArray as $url) {
if(!isValidURL($url)) {
$resultArray[] = $url;
}
}
return implode(' ',$resultArray);
}

Related

Remove end of URL with variable PHP Regex

I suuuuuck at regex and can't even begin to figure out how to remove everything from #edit to the end which contains a veriable of the url from this kind of URL:
https://docs.google.com/presentation/d/1aa_xpsyJtslFJsg4UndsjDvlCe7Vu97_i6Q8zSKofy4/edit?usp=sharing
Any help would be greatly appreciated!
Thanks!

Using strstr() with the third parameter set to true will be the cleanest, most direct non-regex approach. ...and you won't have to sweat your "suuuuucky" regex skills ;) This will isolate the substring from start of the string to the character before your search substring.
Code: (Demo)
$url = 'https://docs.google.com/presentation/d/1aa_xpsyJtslFJsg4UndsjDvlCe7Vu97_i6Q8zSKofy4/edit?usp=sharing';
echo strstr($url, '/edit', true); // https://docs.google.com/presentation/d/1aa_xpsyJtslFJsg4UndsjDvlCe7Vu97_i6Q8zSKofy4
echo "\n";
echo strstr($url, '/edit?', true); // https://docs.google.com/presentation/d/1aa_xpsyJtslFJsg4UndsjDvlCe7Vu97_i6Q8zSKofy4
*note: If the querystring (beginning witih ?) will always exist after /edit, adding the ? to the search substring can only improve accuracy.
Why is this the best function to call? It doesn't leverage the overhead of calling the regex engine, it doesn't generate any temporary arrays, and it is a single function call as opposed to substr()-strrpos().
If your use cases are a bit more complex and this approach is letting you down, calling parse_url() should stabilize things sufficiently to allow you extract the appropriate url components.
Code: (Demo)
$url = 'https://docs.google.com/presentation/d/1aa_xpsyJtslFJsg4UndsjDvlCe7Vu97_i6Q8zSKofy4/edit?usp=sharing';
$components = parse_url($url);
echo $components['scheme'], '://', $components['host'], strstr($components['path'],'/edit',true);

I believe you are trying to parse the query parameters at the end of the url. You can do so by using the explode function:
$url = "https://docs.google.com/presentation/d/1aa_xpsyJtslFJsg4UndsjDvlCe7Vu97_i6Q8zSKofy4/edit?usp=sharing";
print(explode('/edit', $url)[1]);
which will print
?usp=sharing

Extract URL from string

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks

John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.

Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?

This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}

preg_match to array. PHP

Calling all the PHP helpers out there.
So basically I would like to give the function preg_match a variable that can contain a couple thousand lines of code) and have it search using a wildcard + strings either side of the widlcard.
For example I would like to search for strings that look like this <a href="*.pdf">
I would then like the function to return every match (along with the html shiz around the wildcard, this is to catch any directory structures too) in an array that I can loop through using a foreach(){} loop.
I'm guessing this is possible, would anyone have the time to help me with this?
I've check through all the preg_match lit' and through the answers on here, but I can't seem to get the patterns correct. Thanks in advance.
Peace out.

unset($matches);
preg_match_all('/<a href="[^"]+\.pdf">/',$text,$matches);
foreach ($matches as $match)
{
$shiz = $match[0];
// Your code here ...
}

PHP Regex help, getting part of a link

I'm trying to write a regex in php that in a line like
<a href="mypage.php?(some junk)&p=12345&(other junk)" other link stuff>Text</a>
and it will only return me "p=12345", or even "12345". Note that the (some junk)& and the &(otherjunk) may or may not be present.
Can I do this with one expression, or will I need more than one? I can't seem to work out how to do it in one, which is what I would like if at all possible. I'm also open to other methods of doing this, if you have a suggestion.
Thanks

Perhaps a better tactic over using a regular expressoin in this case is to use parse_url.
You can use that to get the query (what comes after the ? in your URL) and split on the '&' character and then the '=' to put things into a nice dictionary.

Use parse_url and parse_str:
$url = 'mypage.php?(some junk)&p=12345&(other junk)';
$parsed_url = parse_url($url);
parse_str($parsed_url['query'], $parsed_str);
echo $parsed_str['p'];

preg_match pick URL from other site

I want to pick all directory URLs from this site.
I did the pregmatch, but it retrieves the entire site URL, it means unnecessary URL links also.
Rendering, here is my code.
How do get all the submission links from that site?

I tried running this and it seems to work, only changed the regex
<?php
for($i=0;$i<=25;$i++){
$site_url = "http://www.directorymaximizer.com/index.php?pageNum_directory_list=$i";
$preg_math = file_get_contents($site_url);
$regex = '#-->(https?://[^<]*)<\!--#';
preg_match_all($regex, $preg_math, $matches, PREG_PATTERN_ORDER);
foreach($matches as $key=>$val){
if($val!="" && !is_numeric($val)){
foreach(array_unique($val) as $key1=>$val1){
if( $val1!="" && !is_numeric($val1)){
echo $val1;
echo "<br />\n";
}
}
}
}
}

You'll want a HTML parser for that. HTML is irregular, so regular expressions don't work well.

To use a regular expression for this you need some consistent delimiters. Thankfully, the URLs you want - and only those you want - seem look like this in source:
target="_blank">-->the url is here<!--</a>-->
Meaning the regular expression you'd want is:
#target="_blank">-->(?P<url>.+?)<!--</a>-->#
Where matches from the first capture group, indexed under "url", will contain the - surprise - URLs. Why the named capture group? Just seems easier to figure out what it is you're doing when you look back at your code.

I have a nifty little tool for you to make regular expression keys with.
Go check out RegExr at gskinner.com.
Additionally I believe this is the pattern your looking for. For an anchor to be matched it must have a full URL including the domain. I will output the URL, domain, and path in an array. See below.
preg_match('/http:\/\/(?P[a-z0-9/]+\.[\w]+)(?P[\/\?\w\.=\&]+)?)[\s\w="]+>/', $site, $anchors);
$url = $anchors['url'];
$domain = $anchors['domain'];
$path = $anchors['path'];
Let me know how it goes. I did not test this, so I apologize if there is an error.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Strip All Urls From A Mixed String ( php ) - php

Regular expression for URL and how to use regular expression with php should help you.

Related

Remove end of URL with variable PHP Regex

Extract URL from string

preg_match to array. PHP

PHP Regex help, getting part of a link

preg_match pick URL from other site

Categories

Resources