php regex preg_match on a variable containing a url

php regex preg_match on a variable containing a url - php

I'm trying to run a regex on a url to extract all the segments after the host. I can't get it working when the host segment is in a variable and i'm not sure how to get it working
// this works
if(preg_match("/^http\:\/\/myhost(\/[a-z0-9A-Z-_\/.]*)$/", $url, $matches)) {
return $matches[2];
}
// this doesn't work
$siteUrl = "http://myhost";
if(preg_match("/^$siteUrl(\/[a-z0-9A-Z-_\/.]*)$/", $url, $matches)) {
return $matches[2];
}
// this doesn't work
$siteUrl = preg_quote("http://myhost");
if(preg_match("/^$siteUrl(\/[a-z0-9A-Z-_\/.]*)$/", $url, $matches)) {
return $matches[2];
}

In PHP, there is a function called parse_url. (Something similar to what you are trying to achieve through your code).
<?php
$url = 'http://username:password#hostname/path?arg=value#anchor';
print_r(parse_url($url));
echo parse_url($url, PHP_URL_PATH);
?>
OUTPUT :
Array
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)
/path

You forgot to escape the / in your variable declaration. One quick fix is to change your regex delimiter from / to #. Try:
$siteUrl = "http://myhost";
if(preg_match("#^$siteUrl(\/[a-z0-9A-Z-_\/.]*)$#", $url, $matches)) { //note the hashtags!
return $matches[2];
}
Or without changing the regex delimiter:
$siteUrl = "http:\/\/myhost"; //note how we escaped the slashes
if(preg_match("/^$siteUrl(\/[a-z0-9A-Z-_\/.]*)$/", $url, $matches)) { //note the hashtags!
return $matches[2];
}

Related

Having trouble with preg_match pinterest username url

Please help the statement i am using for matching pinterest username url is
$url = http://pinterest.com/username
preg_match("|^http(s)?://pinterest.com/(.*)?$|i", $url);
but preg_match result are returning 0

You are missing the third parameter of the preg_match function.
$url = "http://pinterest.com/username";
preg_match("|^http(s)?://pinterest.com/(.*)?$|i", $url, $match);
print_r($match);
results in
Array
(
[0] => http://pinterest.com/username
[1] =>
[2] => username
)
Or in an if statement:
$url = "http://pinterest.com/username";
if (preg_match("|^http(s)?://pinterest.com/(.*)?$|i", $url, $match)) {
// true
}

<?php
$url = "http://pinterest.com/username";
if(preg_match("|^http(s)?://pinterest.com/(.*)?$|i", $url)){
echo "true";
}
else{
echo "false";
}
?>
output:
true
What else you want ?

No one said that need to escape point.
So more correct code will be something like this:
$url = "https://pinterest.com/username";
preg_match("|(?:https?://)(?:www\.)?pinterest\.com/(.+)/?|i", $url, $match);
It will return username. I don't know the rules that have pinterest for usernames so I just match all that are inside of slashes.
It will work with links like:
https://pinterest.com/username/
https://www.pinterest.com/username
pinterest.com/username
and other
Don't use this regular expression for validation

Extract top domain from string php

I need to extract the domain name out of a string which could be anything. Such as:
$sitelink="http://www.somewebsite.com/product/3749875/info/overview.html";
or
$sitelink="http://subdomain.somewebsite.com/blah/blah/whatever.php";
In any case, I'm looking to extract the 'somewebsite.com' portion (which could be anything), and discard the rest.

With parse_url($url)
<?php
$url = 'http://username:password#hostname/path?arg=value#anchor';
print_r(parse_url($url));
?>
The above example will output:
Array
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)
Using thos values
echo parse_url($url, PHP_URL_HOST); //hostname
or
$url_info = parse_url($url);
echo $url_info['host'];//hostname

here it is
<?php
$sitelink="http://www.somewebsite.com/product/3749875/info/overview.html";
$domain_pieces = explode(".", parse_url($sitelink, PHP_URL_HOST));
$l = sizeof($domain_pieces);
$secondleveldomain = $domain_pieces[$l-2] . "." . $domain_pieces[$l-1];
echo $secondleveldomain;
note that this is not probably the behavior you are looking for, because, for hosts like
stackoverflow.co.uk
it will echo "co.uk"
see:
http://publicsuffix.org/learn/
http://www.dkim-reputation.org/regdom-libs/
http://www.dkim-reputation.org/regdom-lib-downloads/ <-- downloads here, php included

2 complexe url
$url="https://www.example.co.uk/page/section/younameit";
or
$url="https://example.co.uk/page/section/younameit";
To get "www.example.co.uk":
$host=parse_url($url, PHP_URL_HOST);
To get "example.co.uk" only
$parts = explode('www.',$host);
$domain = $parts[1];
// ...or...
$domain = ltrim($host, 'www.')
If your url includes "www." or not you get the same end result, i.e. "example.co.uk"
Voilà!

You need package that uses Public Suffix List, only in this way you can correctly extract domains with two-, third-level TLDs (co.uk, a.bg, b.bg, etc.) and multilevel subdomains. Regex, parse_url() or string functions will never produce absolutely correct result.
I recomend use TLD Extract. Here example of code:
$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('http://www.somewebsite.com/product/3749875/info/overview.html');
$result->getSubdomain(); // will return (string) 'www'
$result->getHostname(); // will return (string) 'somewebsite'
$result->getSuffix(); // will return (string) 'com'
$result->getRegistrableDomain(); // will return (string) 'somewebsite.com'

For a string that could be anything, new approach:
function extract_plain_domain($text) {
$text=trim($text,"/");
$text=strtolower($text);
$parts=explode("/",$text);
if (substr_count($parts[0],"http")) {
$parts[0]="";
}
reset ($parts);while (list ($key, $val) = each ($parts)) {
if (!empty($val)) { $text=$val; break; }
}
$parts=explode(".",$text);
if (empty($parts[2])) {
return $parts[0].".".$parts[1];
} else {
$num_parts=count($parts);
return $parts[$num_parts-2].".".$parts[$num_parts-1];
}
} // end function extract_plain_domain

You can use the Utopia Domains library (https://github.com/utopia-php/domains), it will return the domain TLD and public suffix based on Mozilla public suffix list (https://publicsuffix.org), it can be used as an alternative to the currently archived TLDExtract package.
You can use 'parse_url' function to get the hostname from your URL and than use Utopia Domains parser to get the correct TLD and join it together with the domain name:
<?php
require_once './vendor/autoload.php';
use Utopia\Domains\Domain;
$url = 'http://demo.example.co.uk/site';
$domain = new Domain(parse_url($url, PHP_URL_HOST)); // demo.example.co.uk
var_dump($domain->get()); // demo.example.co.uk
var_dump($domain->getTLD()); // uk
var_dump($domain->getSuffix()); // co.uk
var_dump($domain->getName()); // example
var_dump($domain->getSub()); // demo
var_dump($domain->isKnown()); // true
var_dump($domain->isICANN()); // true
var_dump($domain->isPrivate()); // false
var_dump($domain->isTest()); // false
var_dump($domain->getName().'.'.$domain->getSuffix()); // example.co.uk

check string for youtube link or image

I have a small piece of code that checks a string for a url and adds the < a href> tag to create a link. I also have it check the string for a youtube link and then add rel="youtube" to the < a> tag.
How can I get the code to only add rel to the youtube links?
How can I get it to add a different rel to any type of image link?
$text = "http://site.com a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M here is another site";
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '\4', $text );
if(preg_match('/http:\/\/www\.youtube\.com\/watch\?v=[^&]+/', $linkstring, $vresult)) {
$linkstring = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '<a rel="youtube" href="\0">\4</a>', $text );
$type= 'youtube';
}
else {
$type = 'none';
}
echo $text;
echo $linkstring, "<br />";
echo $type, "<br />";

Try http://simplehtmldom.sourceforge.net/.
Code:
<?php
include('simple_html_dom.php');
$html = str_get_html('Link');
$html->find('a', 0)->rel = 'youtube';
echo $html;
Output:
[username#localhost dom]$ php dom.php
Link
You can build an entire page DOM or a simple single link with this library.
Detecting hostname of URL:
Pass the url to parse_url. parse_url returns an array of the URL parts.
Code:
print_r(parse_url('http://www.youtube.com/watch?v=UyxqmghxS6M'));
Output:
Array
(
[scheme] => http
[host] => www.youtube.com
[path] => /watch
[query] => v=UyxqmghxS6M
)

Try the following:
//text
$text = "http://site.com/bounty.png a site www.anothersite.com/ http://www.youtube.com/watch?v=UyxqmghxS6M&featured=true here is another site";
//Youtube links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}youtube\.com\/watch\?v=([a-z0-9\-_\|]{11})[^\s]*/i";
$replacement = '<a rel="youtube" href="http://www.youtube.com/watch?v=\3">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
//image links
$pattern = "/(http:\/\/){0,1}(www\.){0,1}[^\/]+\/[^\s]+\.(png|jpg|jpeg|bmp|gif)[^\s]*/i";
$replacement = '<a rel="image" href="\0">\0</a>';
$text = preg_replace($pattern, $replacement, $text);
note that the latter can only detect links to images which have an extension. As such, links like www.example.com?image=3 will not be detected.

php regex to get string inside href tag

I need a regex that will give me the string inside an href tag and inside the quotes also.
For example i need to extract theurltoget.com in the following:
URL
Additionally, I only want the base url part. I.e. from http://www.mydomain.com/page.html i only want http://www.mydomain.com/

Dont use regex for this. You can use xpath and built in php functions to get what you want:
$xml = simplexml_load_string($myHtml);
$list = $xml->xpath("//#href");
$preparedUrls = array();
foreach($list as $item) {
$item = parse_url($item);
$preparedUrls[] = $item['scheme'] . '://' . $item['host'] . '/';
}
print_r($preparedUrls);

$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
echo $info['scheme'].'://'.$info['host']; // http://www.mydomain.com

this expression will handle 3 options:
no quotes
double quotes
single quotes
'/href=["\']?([^"\'>]+)["\']?/'

Use the answer by #Alec if you're only looking for the base url part (the 2nd part of the question by #David)!
$html = 'URL';
$url = preg_match('/<a href="(.+)">/', $html, $match);
$info = parse_url($match[1]);
This will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html" class="myclass" rel="myrel
)
So you can use $href = $info["scheme"] . "://" . $info["host"]
Which gives you:
// http://www.mydomain.com
When you are looking for the entire url between the href, You should be using another regex, for instance the regex provided by #user2520237.
$html = 'URL';
$url = preg_match('/href=["\']?([^"\'>]+)["\']?/', $html, $match);
$info = parse_url($match[1]);
this will give you:
$info
Array
(
[scheme] => http
[host] => www.mydomain.com
[path] => /page.html
)
Now you can use $href = $info["scheme"] . "://" . $info["host"] . $info["path"];
Which gives you:
// http://www.mydomain.com/page.html

http://www.the-art-of-web.com/php/parse-links/
Let's start with the simplest case - a well formatted link with no extra attributes:
/<a href=\"([^\"]*)\">(.*)<\/a>/iU

For all href values replacement:
function replaceHref($html, $replaceStr)
{
$match = array();
$url = preg_match_all('/<a [^>]*href="(.+)"/', $html, $match);
if(count($match))
{
for($j=0; $j<count($match); $j++)
{
$html = str_replace($match[1][$j], $replaceStr.urlencode($match[1][$j]), $html);
}
}
return $html;
}
$replaceStr = "http://affilate.domain.com?cam=1&url=";
$replaceHtml = replaceHref($html, $replaceStr);
echo $replaceHtml;

This will handle the case where there are no quotes around the URL.
/<a [^>]*href="?([^">]+)"?>/
But seriously, do not parse HTML with regex. Use DOM or a proper parsing library.

/href="(https?://[^/]*)/
I think you should be able to handle the rest.

Because Positive and Negative Lookbehind are cool
/(?<=href=\").+(?=\")/
It will match only what you want, without quotation marks
Array (
[0] => theurltoget.com )

How to check if a given value is a valid URL

I need some function to check is the given value is a url.
I have code:
<?php
$string = get_from_db();
list($name, $url) = explode(": ", $string);
if (is_url($url)) {
$link = array('name' => $name, 'link' => $url);
} else {
$text = $string;
}
// Make some things
?>

If you're running PHP 5 (and you should be!), just use filter_var():
function is_url($url)
{
return filter_var($url, FILTER_VALIDATE_URL) !== false;
}
Addendum: as the PHP manual entry for parse_url() (and #Liutas in his comment) points out:
This function is not meant to validate the given URL, it only breaks it up into the above listed parts. Partial URLs are also accepted, parse_url() tries its best to parse them correctly.
For example, parse_url() considers a query string as part of a URL. However, a query string is not entirely a URL. The following line of code:
var_dump(parse_url('foo=bar&baz=what'));
Outputs this:
array(1) {
["path"]=>
string(16) "foo=bar&baz=what"
}

use parse_url and check for false
<?php
$url = 'http://username:password#hostname/path?arg=value#anchor';
print_r(parse_url($url));
echo parse_url($url, PHP_URL_PATH);
?>
The above example will output:
Array
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)
/path

You can check if ParseUrl returns false.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php regex preg_match on a variable containing a url - php

Related

Having trouble with preg_match pinterest username url

Extract top domain from string php

check string for youtube link or image

php regex to get string inside href tag

How to check if a given value is a valid URL

Categories

Resources