preg_match pick URL from other site

preg_match pick URL from other site - php

I want to pick all directory URLs from this site.
I did the pregmatch, but it retrieves the entire site URL, it means unnecessary URL links also.
Rendering, here is my code.
How do get all the submission links from that site?

I tried running this and it seems to work, only changed the regex
<?php
for($i=0;$i<=25;$i++){
$site_url = "http://www.directorymaximizer.com/index.php?pageNum_directory_list=$i";
$preg_math = file_get_contents($site_url);
$regex = '#-->(https?://[^<]*)<\!--#';
preg_match_all($regex, $preg_math, $matches, PREG_PATTERN_ORDER);
foreach($matches as $key=>$val){
if($val!="" && !is_numeric($val)){
foreach(array_unique($val) as $key1=>$val1){
if( $val1!="" && !is_numeric($val1)){
echo $val1;
echo "<br />\n";
}
}
}
}
}

You'll want a HTML parser for that. HTML is irregular, so regular expressions don't work well.

To use a regular expression for this you need some consistent delimiters. Thankfully, the URLs you want - and only those you want - seem look like this in source:
target="_blank">-->the url is here<!--</a>-->
Meaning the regular expression you'd want is:
#target="_blank">-->(?P<url>.+?)<!--</a>-->#
Where matches from the first capture group, indexed under "url", will contain the - surprise - URLs. Why the named capture group? Just seems easier to figure out what it is you're doing when you look back at your code.

I have a nifty little tool for you to make regular expression keys with.
Go check out RegExr at gskinner.com.
Additionally I believe this is the pattern your looking for. For an anchor to be matched it must have a full URL including the domain. I will output the URL, domain, and path in an array. See below.
preg_match('/http:\/\/(?P[a-z0-9/]+\.[\w]+)(?P[\/\?\w\.=\&]+)?)[\s\w="]+>/', $site, $anchors);
$url = $anchors['url'];
$domain = $anchors['domain'];
$path = $anchors['path'];
Let me know how it goes. I did not test this, so I apologize if there is an error.

Related

Regular Expression Validation PHP

I've been trying to get this to work for some time now but can't. Here is my problem:
I have the following reg. expression: (http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?. I'm trying to validate a URL.
The problem is when I have for example:
"https://www.youtube.com/watch?v=QK8mJJJvaes<br />Hello" (this is how it saves in the database using nl2br)
It validates up to this:https://www.youtube.com/watch?v=QK8mJJJvaes<br. I've read that the problem might be because of the \S* in the reg. expression. But if I take that out it only validates https://www.youtube.com/.
I've also thought of adding a space before the <br />, but I don't know if their is a better solution.
Any help is greatly appreciated :).
Full Code:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$finalMsg = 'https://www.youtube.com/watch?v=QK8mJJJvaes<br />Hello';
// Check if there is a url in the text
if(preg_match_all($reg_exUrl, $finalMsg, $url)){
// make the urls hyper links
$matches = array_unique($url[0]);
foreach($matches as $match) {
$replacement = "<a href=".$match." target='_blank'>{$match}</a>";
$finalMsg = str_replace($match,$replacement,$finalMsg);
}
}

Change it to this:
/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S[^<]*)?/
This will at least validate your given URL, and any other that ends with a tag...
Test it here: https://regex101.com/
EDIT: Isn't matching root paths. The solution from #Jonathan Kuhn in the comments is the best one:
/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^\s<]*)?/
UPDATE:
Just revisiting some old answers and I'm irritated why I commented like I did.. I don't see the problem though, your code works. :D
Although this short piece of code would do the same:
$url = "https://www.youtube.com/watch?v=QK8mJJJvaes<br />Hello";
$regex = '/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^\s<]*)?/';
// make the URLs hyperlinks
$url = preg_replace($regex, '$0', $url);
echo $url;

Regex/Php: Cannot match question mark in live site

I would like to remove the $_GET parameter of the first "page" item on a website.
The following works perfectly in a test script on my local server:
$urls = array(
'http://www.foo.com/bar.html?p=1', //should match
'http://www.foo.com/bar.html?p=23',
'http://www.foo.com/bar.html?p=120',
'http://www.foo.com/bar.html?baz=123&p=1' //should match
);
foreach ($urls as $url) {
echo $url . '<br>';
echo preg_replace('/([\?&]p=1)(?!\d)/', '', $url) . '<p>';
}
This produces:
http://www.foo.com/bar.html?p=1
http://www.foo.com/bar.html
http://www.foo.com/bar.html?p=23
http://www.foo.com/bar.html?p=23
http://www.foo.com/bar.html?p=120
http://www.foo.com/bar.html?p=120
http://www.foo.com/bar.html?baz=123&p=1
http://www.foo.com/bar.html?baz=123
However on the live site, it never matches.
To make matters worse,
str_replace('?p=1','',$url);
will not work as well. What am I missing? I can match a single question mark, but as soon as something follows it, I'm out of luck. This is the case for both str_replace and preg_replace. I feel like I'm missing something obvious, but I cannot figure it out. Thank you for your help.
Solution:
In my specific case, it turned out that the underlying Magento shop system was already giving out html_encoded characters. This, plus the fact the first parameter is always a session ID which is later removed from the URL string, made my task as easy as
$url = str_replace('&p=1', '', $url);

try \\\? instead of \? ; if that doesn't work, you might run a regex engine version which doesnt support negative lookahead.
In that case you could reform your preg_replace to
preg_replace('/([\?&]p=1)([^\d])/', '$2', $url) . '<p>';
which would consume the non-digit, but put it back in again. There might be edge cases where this differs from your regex, but I don't think you'd be able to encounter those with urls (and I can't think of any from the top of my head)
of course, there are other non-regex solutions to this, but as regex is a very powerful tool, it's always good to learn something about it ;)

regex to get current page or directory name?

I am trying to get the page or last directory name from a url
for example if the url is: http://www.example.com/dir/ i want it to return dir or if the passed url is http://www.example.com/page.php I want it to return page Notice I do not want the trailing slash or file extension.
I tried this:
$regex = "/.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*/i";
$name = strtolower(preg_replace($regex,"$2",$url));
I ran this regex in PHP and it returned nothing. (however I tested the same regex in ActionScript and it worked!)
So what am I doing wrong here, how do I get what I want?
Thanks!!!

Don't use / as the regex delimiter if it also contains slashes. Try this:
$regex = "#^.*\.(com|gov|org|net|mil|edu)/([a-z_\-]+).*$#i";

You may try tho escape the "/" in the middle. That simply closes your regex. So this may work:
$regex = "/.*\.(com|gov|org|net|mil|edu)\/([a-z_\-]+).*/i";
You may also make the regex somewhat more general, but that's another problem.

You can use this
array_pop(explode('/', $url));
Then apply a simple regex to remove any file extension

Assuming you want to match the entire address after the domain portion:
$regex = "%://[^/]+/([^?#]+)%i";
The above assumes a URL of the format extension://domainpart/everythingelse.

Then again, it seems that the problem here isn't that your RegEx isn't powerful enough, just mistyped (closing delimiter in the middle of the string). I'll leave this up for posterity, but I strongly recommend you check out PHP's parse_url() method.
This should adequately deliver:
substr($s = basename($_SERVER['REQUEST_URI']), 0, strrpos($s,'.') ?: strlen($s))
But this is better:
preg_replace('/[#\.\?].*/','',basename($path));
Although, your example is short, so I cannot tell if you want to preserve the entire path or just the last element of it. The preceding example will only preserve the last piece, but this should save the whole path while being generic enough to work with just about anything that can be thrown at you:
preg_replace('~(?:/$|[#\.\?].*)~','',substr(parse_url($path, PHP_URL_PATH),1));

As much as I personally love using regular expressions, more 'crude' (for want of a better word) string functions might be a good alternative for you. The snippet below uses sscanf to parse the path part of the URL for the first bunch of letters.
$url = "http://www.example.com/page.php";
$path = parse_url($url, PHP_URL_PATH);
sscanf($path, '/%[a-z]', $part);
// $part = "page";

This expression:
(?<=^[^:]+://[^.]+(?:\.[^.]+)*/)[^/]*(?=\.[^.]+$|/$)
Gives the following results:
http://www.example.com/dir/ dir
http://www.example.com/foo/dir/ dir
http://www.example.com/page.php page
http://www.example.com/foo/page.php page
Apologies in advance if this is not valid PHP regex - I tested it using RegexBuddy.

Save yourself the regular expression and make PHP's other functions feel more loved.
$url = "http://www.example.com/page.php";
$filename = pathinfo(parse_url($url, PHP_URL_PATH), PATHINFO_FILENAME);
Warning: for PHP 5.2 and up.

Strip All Urls From A Mixed String ( php )

i reposted this question because i didn't find a good answer.
i have a string which can contains text with urls.
i want a function to strip all urls from this string and just let the text.
by example the string can contains like this :
1) hey take a look here : http://xxx.xxx/545df5 this is nice!
2) hey take a look here : http://www.xxx.xxx/545df5 this is nice!
3) hey take a look here : xxx.xxx/545df5 this is nice!
4) hey take a look here : www.xxx.xxx/545df5 this is nice!
Thanks

Regular expression for URL and how to use regular expression with php should help you.

What you really need is a solid regex to find urls in a string and you can preg_replace that pattern with nothing. I can tell you though that tracking down a regex like that is not easy. Depending on the variations in the urls you're looking for (i.e. http:// vs https:// vs ftp://) You could run into real trouble trying to account for all that.
Here is a page that I found to be a good start though.

Regex is the way to go as was discussed prior. Finding one isn't that terribly hard (google: url regex pattern) One example returned is here
http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm
I would also recommend you test your regex using one of the many fine online regex testers. My favorite (for non-java) is
http://www.regextester.com/

This function should do it(assuming your strings are seperated by space " "):
function isValidURL($url) {
return preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $url);
}
function cleanUpUrls($urls) {
$urlArray = explode(' ',$urls);
$resultArray = array();
foreach ($urlArray as $url) {
if(!isValidURL($url)) {
$resultArray[] = $url;
}
}
return implode(' ',$resultArray);
}

Extract URLs from text in PHP

I have this text:
$string = "this is my friend's website http://example.com I think it is coll";
How can I extract the link into another variable?
I know it should be by using regular expression especially preg_match() but I don't know how?

Probably the safest way is using code snippets from WordPress. Download the latest one (currently 3.1.1) and see wp-includes/formatting.php. There's a function named make_clickable which has plain text for param and returns formatted string. You can grab codes for extracting URLs. It's pretty complex though.
This one line regex might be helpful.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $string, $match);
But this regex still can't remove some malformed URLs (ex. http://google:ha.ckers.org ).
See also:
How to mimic StackOverflow Auto-Link Behavior

I tried to do as Nobu said, using Wordpress, but to much dependencies to other WordPress functions I instead opted to use Nobu's regular expression for preg_match_all() and turned it into a function, using preg_replace_callback(); a function which now replaces all links in a text with clickable links. It uses anonymous functions so you'll need PHP 5.3 or you may rewrite the code to use an ordinary function instead.
<?php
/**
* Make clickable links from URLs in text.
*/
function make_clickable($text) {
$regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#';
return preg_replace_callback($regex, function ($matches) {
return "<a href=\'{$matches[0]}\'>{$matches[0]}</a>";
}, $text);
}

URLs have a quite complex definition — you must decide what you want to capture first. A simple example capturing anything starting with http:// and https:// could be:
preg_match_all('!https?://\S+!', $string, $matches);
$all_urls = $matches[0];
Note that this is very basic and could capture invalid URLs. I would recommend catching up on POSIX and PHP regular expressions for more complex things.

The code that worked for me (especially if you have several links in your $string):
$string = "this is my friend's website https://www.example.com I think it is cool, but this one is cooler https://www.stackoverflow.com :)";
$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i';
preg_match_all($regex, $string, $matches);
$urls = $matches[0];
// go over all links
foreach($urls as $url)
{
echo $url.'<br />';
}
Hope that helps others as well.

If the text you extract the URLs from is user-submitted and you're going to display the result as links anywhere, you have to be very, VERY careful to avoid XSS vulnerabilities, most prominently "javascript:" protocol URLs, but also malformed URLs that might trick your regexp and/or the displaying browser into executing them as Javascript URLs. At the very least, you should accept only URLs that start with "http", "https" or "ftp".
There's also a blog entry by Jeff where he describes some other problems with extracting URLs.

preg_match_all('/[a-z]+:\/\/\S+/', $string, $matches);
This is an easy way that'd work for a lot of cases, not all. All the matches are put in $matches. Note that this do not cover links in anchor elements (<a href=""...), but that wasn't in your example either.

You could do like this..
<?php
$string = "this is my friend's website http://example.com I think it is coll";
echo explode(' ',strstr($string,'http://'))[0]; //"prints" http://example.com

preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, &$matches);
$matches = $matches[1];
$list = array();
foreach($matches as $var)
{
print($var."<br>");
}

You could try this to find the link and revise the link (add the href link).
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want to filter goes here. http://example.com";
if(preg_match($reg_exUrl, $text, $url)) {
echo preg_replace($reg_exUrl, "{$url[0]} ", $text);
} else {
echo "No url in the text";
}
refer here: http://php.net/manual/en/function.preg-match.php

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$urlHighlight->getUrls("this is my friend's website http://example.com I think it is coll");
// return: ['http://example.com']
For more details see readme. For covered url cases see test.

Here is a function I use, can't remember where it came from but seems to do a pretty good job of finding links in the text. and making them links.
You can change the function to suit your needs. I just wanted to share this as I was looking around and remembered I had this in one of my helper libraries.
function make_links($str){
$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
return '' . "$input";
}, $str);
}
Use:
$subject = 'this is a link http://google:ha.ckers.org maybe don't want to visit it?';
echo make_links($subject);
Output
this is a link http://google:ha.ckers.org maybe don't want to visit it?

<?php
preg_match_all('/(href|src)[\s]?=[\s\"\']?+(.*?)[\s\"\']+.*?/', $webpage_content, $link_extracted);
preview

This Regex works great for me and i have checked with all types of URL,
<?php
$string = "Thisregexfindurlhttp://www.rubular.com/r/bFHobduQ3n mixedwithstring";
preg_match_all('/(https?|ssh|ftp):\/\/[^\s"]+/', $string, $url);
$all_url = $url[0]; // Returns Array Of all Found URL's
$one_url = $url[0][0]; // Gives the First URL in Array of URL's
?>
Checked with lots of URL's can find here http://www.rubular.com/r/bFHobduQ3n

public function find_links($post_content){
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// Check if there is a url in the text
if(preg_match_all($reg_exUrl, $post_content, $urls)) {
// make the urls hyper links,
foreach($urls[0] as $url){
$post_content = str_replace($url, ' LINK ', $post_content);
}
//var_dump($post_content);die(); //uncomment to see result
//return text with hyper links
return $post_content;
} else {
// if no urls in the text just return the text
return $post_content;
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_match pick URL from other site - php

I want to pick all directory URLs from this site. I did the pregmatch, but it retrieves the entire site URL, it means unnecessary URL links also. Rendering, here is my code. How do get all the submission links from that site?

You'll want a HTML parser for that. HTML is irregular, so regular expressions don't work well.

Related

Regular Expression Validation PHP

Regex/Php: Cannot match question mark in live site

regex to get current page or directory name?

Strip All Urls From A Mixed String ( php )

Extract URLs from text in PHP

Categories

Resources