Say if I have two strings
$first = 'http://www.example.com';
$second = 'www.example.com/';
How could I determine they match? I just care that the example part matches. I'm thinking some form of Regex pattern would match but I can't figure it out at all.
Don't use a regex if you're trying to evaluate structured data. Regexes are not a magic wand you wave at every problem that happens to involve strings. What if you have a URL like http://www.some-other-domain.com/blah/blah/?www.example.com?
If you're trying to match a domain name to a domain name, then break apart the URL to get the host and compare that. In PHP, use the parse_url function. That will give you www.example.com as the host name, and then you can compare that to make sure it is the hostname you expect.
Try this
function DomainUrl($x) {
$url = $x;
if ( substr($url, 0, 7) == 'http://') { $url = substr($url, 7); }
if ( substr($url, 0, 8) == 'https://') { $url = substr($url, 8); }
if ( substr($url, 0, 4) == 'www.') { $url = substr($url, 4); }
if ( substr($url, 0, 4) == 'www9.') { $url = substr($url, 4); }
if ( strpos($url, '/') !== false) {
$ex = explode('/', $url);
$url = $ex['0'];
}
return $url;
}
$first = DomainUrl('http://www.example.com');
$second = DomainUrl('www.example.com/');
if($first == $second){
echo 'Match';
}else{
echo 'Not Match';
}
Related
I have this url: "http://example.com/search/dep_city:LON/arr_city=NYC".
How can i get the values "LON" and "NYC" from this url in PHP?
This would be easy if url would be in format "dep_city=LON" by using "$_GET['dep_city']" but in this case i am confused how to do this.
Thanks in advance for your time and help.
Get the URL from the server, explode it into parts.
Check the index of : or = for each part, parse it into an array.
<?php
$url = substr( $_SERVER["REQUEST_URI"], 1, strlen($_SERVER["REQUEST_URI"]));
$vars = explode("/", $url);
foreach($vars as $var){
$indexEquals = strpos($var, "=");
if($indexEquals === false){
$indexColon = strpos($var, ":");
$part1 = substr($var, 0, $indexColon);
$part2 = substr($var, $indexColon+1, strlen($var));
}else{
$part1 = substr($var, 0, $indexEquals);
$part2 = substr($var, $indexEquals+1, strlen($var));
}
$params[$part1] = $part2;
}
echo json_encode($params);
die();
I followed a tutorial on making a web crawler app. I just simply pulls all the links from a page and then follows them. I have a problem with pushing the foreach loop of links to the global variable. I keep getting an error that says the second variable in the in_array should be an array which is what i set it to. Is there anything there you guys might see bugging up the code?
Error:
in_array() expects parameter 2 to be array, null given
HTML:
<?php
$to_crawl = "http://thechive.com/";
$c = array();
function get_links($url){
global $c;
$input = file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
$base_url = parse_url($url, PHP_URL_HOST);
foreach($l as $link){
if(strpos($link, '#')){
$link = substr($link, 0, strpos($link, '#'));
}
if(substr($link, 0, 1) == "."){
$link = substr($link, 1);
}
if(substr($link, 0, 7) == "http://"){
$link = $link;
} elseif(substr($link, 0, 8) == "https://"){
$link = $link;
} elseif(substr($link, 0, 2) == "//"){
$link = substr($link, 2);
} elseif(substr($link, 0, 1) == "#"){
$link = $url;
} elseif(substr($link, 0, 7) == "mailto:"){
$link = "[".$link."]";
} else{
if(substr($link, 0,1) != "/"){
$link = $base_url."/".$link;
} else{
$link = $base_url.$link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "["){
if(substr($link, 0 , 8) == "https://"){
$link = "https://".$link;
} else{
$link= "http://".$link;
}
}
if (!in_array($link, $c)){
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach($c as $page){
get_links($page);
}
foreach($c as $page){
echo $page."<br/ >";
}
?>
Trying to make "global" your $c at each iteration is a bad design. You should avoid "global" when it's possible.
Here I see 2 choices :
1/ Pass your array as reference (search google for that) in parameter of the "get_links" function. It will allow you to fill the array from your function.
Exemple :
function getlinks($url, &$links){
//do your stuff to find the links
//then add each link to the array
$links[] = $oneLink;
}
$allLinks = array();
getlinks("thefirsturl.com", $allLinks);
//call getlinks as many as you want
//then your array will contain all the links
print_r($allLinks);
Or 2/ Make "get_links" return an array of links, and concatenate it into a bigger one to store all your links.
function getlinks($url){
$links = array();
//do your stuff to find the links
//then add each link to the array
$links[] = $oneLink;
return $links;
}
$allLinks = array();
$allLinks += getlinks("thefirsturl.com");
//call getlinks as many as you want. Note the concatenation operator +=
print_r($allLinks);
I want to extract the website name, from a link, so I write the following function:
protected function getWebsiteName()
{
$prefixs = ['https://', 'http://', 'www.'];
foreach($prefixs as $prefix)
{
if(strpos($this->website_link, $prefix) !== false)
{
$len = strlen($prefix);
$this->website_name = substr($this->website_link, $len);
$this->website_name = substr($this->website_name, 0, strpos($this->website_name, '.'));
}
}
}
The problem is that when I use I website link that look like https://www.github.com, the result is: s://www, and the function only works when I remove that 'www.' from the array list.
Any ideas why this is happening, or how I can improve this function?
You could use parse_url();, Try:
print_r(parse_url('https//www.name/'));
Let's look at your code. Each time you go through the foreach, you are applying your logic from the original website_link every time. This means when you run strlen in the situation of www. after the first two iterations, this happens:
$prefix is www.
Therefore, $len = 4 (the length of $prefix)
$this->website_link is still https://www.github.com
You apply substr($this->website_link, 4)
Result is $this->website_name = 's://www.github.com'
You apply substr($this->website_name, 0, 7) (7 being the result of strpos($this->website_name, '.')
The result is $this->website_name = 's://www'
To fix this, you should save $this->website_link to $temp and then use the following code:
$temp = $this->website_link;
foreach($prefixs as $prefix)
{
if(strpos($temp, $prefix) !== false)
{
$len = strlen($prefix);
$temp = substr($temp, $len);
}
}
$this->website_name = substr($temp, 0, strpos($temp, '.'));
I'd suggest #dynamic's answer, but if you want to continue the strategy of string replacement, use str_replace. It accepts arrays for the needle!
$prefixes = ['https://', 'http://', 'www.'];
$this->website_name = str_replace($prefixes, '', $this->website_link);
$this->website_name = substr($this->website_name, 0, strpos($this->website_name, '.'));
Yes, use parse_url along with preg_match should do the job
function getWebsiteName($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : '';
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
This is fixing your code.
function getWebsiteName()
{
$this->website_name = $this->website_link;
$prefixs = array('https://', 'http://', 'www.');
foreach($prefixs as $prefix)
{
if (substr($this->website_name, 0, strlen($prefix)) == $prefix) {
$this->website_name = substr($this->website_name, strlen($prefix));
}
}
}
I am using file_get_content($url) which does not work with url starting from www.
So I am trying to append the http:// and converting into proper form if user entered url is not in correct form.
Check DEMO HERE
<?php
$url= 'www.google.com';
$pad = 'http://';
$cmp = 'www';
$prefix = substr($url , 0,2);
if($cmp == $prefix)
{
echo str_pad($url, strlen($url)+3 ,"$pad",STR_PAD_LEFT);
}
?>
This code does not echo correct url. Any issue here?
Why not use parse_url to figure it out?
$url = "www.example.com/test.php";
$parsedUrl = parse_url($url);
if(!array_key_exists('scheme', $parsedUrl)){
$url = "http://" . $url;
}
echo $url;
codepad example.
This is all you need:
if (strpos($url, '://') === false)
$url = 'http://' . $url;
check this
$url= 'www.google.com';
$pad = 'http://';
$cmp = 'www';
$prefix = substr($url , 0,3);
if($cmp == $prefix)
{
echo str_pad($url, strlen($url)+7 ,"$pad",STR_PAD_LEFT);
}
when I spide a website ,I got a lot of bad url like these.
http://example.com/../../.././././1.htm
http://example.com/test/../test/.././././1.htm
http://example.com/.//1.htm
http://example.com/../test/..//1.htm
all of these should be http://example.com/1.htm.
how to use PHP codes to do this ,thanks.
PS: I use http://snoopy.sourceforge.net/
I get a lot of repeated link in my database , the 'http://example.com/../test/..//1.htm' should be 'http://example.com/1.htm' .
You could do it like this, assuming all the urls you have provided are expected tobe http://example.com/1.htm:
$test = array('http://example.com/../../../././.\./1.htm',
'http://example.com/test/../test/../././.\./1.htm',
'http://example.com/.//1.htm',
'http://example.com/../test/..//1.htm');
foreach ($test as $url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
echo $path.'<br />'.PHP_EOL;
}
/* result
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
http://example.com/1.htm<br />
*/
//or as a function #lpc2138
function getRealUrl($url){
$u = parse_url($url);
$path = $u['scheme'].'://'.$u['host'].'/'.basename($u['path']);
$path .= (!empty($u['query'])) ? '?'.$u['query'] : '';
return $path;
}
You seem to be looking for a algorithm to remove the dot segments:
function remove_dot_segments($abspath) {
$ib = $abspath;
$ob = '';
while ($ib !== '') {
if (substr($ib, 0, 3) === '../') {
$ib = substr($ib, 3);
} else if (substr($ib, 0, 2) === './') {
$ib = substr($ib, 2);
} else if (substr($ib, 0, 2) === '/.' && ($ib[2] === '/' || strlen($ib) === 2)) {
$ib = '/'.substr($ib, 3);
} else if (substr($ib, 0, 3) === '/..' && ($ib[3] === '/' || strlen($ib) === 3)) {
$ib = '/'.substr($ib, 4);
$ob = substr($ob, 0, strlen($ob)-strlen(strrchr($ob, '/')));
} else if ($ib === '.' || $ib === '..') {
$ib = '';
} else {
$pos = strpos($ib, '/', 1);
if ($pos === false) {
$ob .= $ib;
$ib = '';
} else {
$ob .= substr($ib, 0, $pos);
$ib = substr($ib, $pos);
}
}
}
return $ob;
}
This removes the . and .. segments. Any removal of any other segment like an empty one (//) or .\. is not as per standard as it changes the semantics of the path.
You could do some fancy regex but this works just fine.
fixUrl('http://example.com/../../../././.\./1.htm');
function fixUrl($str) {
$str = str_replace('../', '', $str);
$str = str_replace('./', '', $str);
$str = str_replace('\.', '', $str);
return $str;
}