I'm working with some code used to try and find all the website URLs within a block of text. Right now we've already got checks that work fine for URLs formatted such as http://www.google.com or www.google.com but we're trying to find a regex that can locate a URL in a format such as just google.com
Right now our regex is set to search for every domain that we could find registered which is around 1400 in total, so it looks like this:
/(\S+\.(COM|NET|ORG|CA|EDU|UK|AU|FR|PR)\S+)/i
Except with ALL 1400 domains to check in the group(the full thing is around 8400 characters long). Naturally it's running quite slowly, and we've already had the idea to simply check for the 10 or so most commonly used domains but I wanted to check here first to see if there was a more efficient way to check for this specific formatting of website URLs rather than singling every single one out.
You could use a double pass search.
Search for every url-like string, e.g.:
((http|https):\/\/)?([\w-]+\.)+[\S]{2,5}
On every result do some non-regex checks, like, is the length enough, is the text after the last dot part of your tld list, etc.
function isUrl($urlMatch) {
$tldList = ['com', 'net'];
$urlParts = explode(".", $urlMatch);
$lastPart = end($urlParts);
return in_array($lastPart, $tldList);
}
Example
function get_host($url) {
$host = parse_url($url, PHP_URL_HOST);
$names = explode(".", $host);
if(count($names) == 1) {
return $names[0];
}
$names = array_reverse($names);
return $names[1] . '.' . $names[0];
}
Usage
echo get_host('https://google.com'); // google.com
echo "\n";
echo get_host('https://www.google.com'); // google.com
echo "\n";
echo get_host('https://sub1.sub2.google.com'); // google.com
echo "\n";
echo get_host('http://localhost'); // localhost
Demo
Related
I want to check my link in a website, but I also want to check is it visible. I wrote this code:
$content = file_get_contents('tmp/test.html');
$pattern = '/<a\shref="http:\/\/mywebsite.com(.*)">(.*)<\/a>/siU';
$matches = [];
if(preg_match($pattern, $content, $matches)) {
$link = $matches[0];
$displayPattern = '/display(.?):(.?)none/si';
if(preg_match($displayPattern, $link)) {
echo 'not visible';
} else {
echo 'visible';
}
} else {
echo 'not found the link';
}
It works, but not perfect. If the link is like this:
<a class="sg" href="http://mywebsite.com">mywebsite.com</a>
the fist pattern won't work, but if I change the \s to (.*) it gives back string from the first a tag. The second problem is the two pattern. Is there any way to merge the first with negation of the second? The merged pattern has 2 results: visible or not found/invisible.
I'll try to guess.
You are having a problem if your code(one that you fetch with file_get_contents) looks like this
<a class="sg" href="http://mywebsite.com">mywebsite.com</a>
.
.
.
mywebsite.com
Your regex will return everything from first </a> tag because dot matches a new line(I guess you need it turned on, but if you dont, its 's' flag, so remove it)
Therefore
.*
will keep searching everything, so you need to make it greedy
(when its greedy it will stop searching once it finds what its looking for), like this
.*?
Your regex should look like this then
<a.*?href="http:\/\/mywebsite.com(.*?)">(.*?)<\/a>
I need some PHP help with strings.
I have a textbox field where users will enter a facebook profile link.
Example: http://facebook.com/zuck
Now the problem is I need to have EXACTLY this string: "http://graph.facebook.com/zuck".
Inputs could be anything like:
http://facebook.com/zuck
http://www.facebook.com/zuck
www.facebook.com/zuck
facebook.com/zuck
What's the best way to do that? Thank you in advance.
To accept anything in the format of facebook.com/username where username is alphanumeric with dots, dashes, and underscores (not sure what Facebook allows exactly):
if (preg_match('%facebook.com/([a-z0-9._-]+)%i', $input, $m))
{
echo 'http://graph.facebook.com/', $m[1], "\n";
}
Why don't you just ask the user for their username? Instead of accepting a wide variety of input, design the form so that they only have to put in their username.
Something along the lines of this;
This way, you don't even have to validate or store anything other than their username. This could be super helpful down the road when Facebook moves fast and breaks stuff, like the URLs of users. Or if you want to form URLs for something other than graph API, you won't have to pull apart an existing URL.
If given inputs will be always as the ones you give i think that strstr function would hadle this
$array = array('http://facebook.com/zuck', 'http://www.facebook.com/buck', 'www.facebook.com/luck', 'facebook.com/nuck');
foreach($array as $data)
{
if(strstr($data, 'facebook.com/'))
{
echo 'http://graph.'.strstr($data, 'facebook.com/') . '<br>';
}
}
This will output
http://graph.facebook.com/zuck
http://graph.facebook.com/buck
http://graph.facebook.com/luck
http://graph.facebook.com/nuck
Find the last slash in the input
$lastpos = strrchr ( $input , '/' )
Manually concatenate the url and everything after that last slash.
$new_url = 'http://www.facebook.com' . substr($input, $lastpos);
$url = 'http://facebook.com/zuck';
$array = explode('/', str_replace('http://', '', $url));
$username = $array[1];
$finalurl = 'http://graph.facebook.com/zuck'.$username;
echo $finalurl;
This will work with any format of input URL.
Something along the lines of:
Pattern:
(https?://)?(www\.)?(.+?)\/([^/]+)
Replace with:
http://graph.$3/$4
Test it here:
http://www.regexe.com/
I have this code right here:
// get host name from URL
preg_match('#^(?:http://)?([^/]+)#i',
"http://www.joomla.subdomain.php.net/index.html", $matches);
$host = $matches[1];
// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
The output will be php.net
I need just php without .net
Although regexes are fine here, I'd recommend parse_url
$host = parse_url('http://www.joomla.subdomain.php.net/index.html', PHP_URL_HOST);
$domains = explode('.', $host);
echo $domains[count($domains)-2];
This will work for TLD's like .com, .org, .net, etc. but not for .co.uk or .com.mx. You'd need some more logic (most likely an array of tld's) to parse those out .
Group the first part of your 2nd regex into /([^.]+)\.[^.]+$/ and $matches[1] will be php
Late answer and it doesn't work with subdomains, but it does work with any tld (co.uk, com.de, etc):
$domain = "somesite.co.uk";
$domain_solo = explode(".", $domain)[0];
print($domain_solo);
Demo
It's really easy:
function get_tld($domain) {
$domain=str_replace("http://","",$domain); //remove http://
$domain=str_replace("www","",$domain); //remowe www
$nd=explode(".",$domain);
$domain_name=$nd[0];
$tld=str_replace($domain_name.".","",$domain);
return $tld;
}
To get the domain name, simply return $domain_name, it works only with top level domain. In the case of subdomains you will get the subdomain name.
Hacking up what I thought was the second simplest type of regex (extract a matching string from some strings, and use it) in php, but regex grouping seems to be tripping me up.
Objective
take a ls of files, output the commands to format/copy the files to have the correct naming format.
Resize copies of the files to create thumbnails. (not even dealing with that step yet)
Failure
My code fails at the regex step, because although I just want to filter out everything except a single regex group, when I get the results, it's always returning the group that I want -and- the group before it, even though I in no way requested the first backtrace group.
Here is a fully functioning, runnable version of the code on the online ide:
http://ideone.com/2RiqN
And here is the code (with a cut down initial dataset, although I don't expect that to matter at all):
<?php
// Long list of image names.
$file_data = <<<HEREDOC
07184_A.jpg
Adrian-Chelsea-C08752_A.jpg
Air-Adams-Cap-Toe-Oxford-C09167_A.jpg
Air-Adams-Split-Toe-Oxford-C09161_A.jpg
Air-Adams-Venetian-C09165_A.jpg
Air-Aiden-Casual-Camp-Moc-C09347_A.jpg
C05820_A.jpg
C06588_A.jpg
Air-Aiden-Classic-Bit-C09007_A.jpg
Work-Moc-Toe-Boot-C09095_A.jpg
HEREDOC;
if($file_data){
$files = preg_split("/[\s,]+/", $file_data);
// Split up the files based on the newlines.
}
$rename_candidates = array();
$i = 0;
foreach($files as $file){
$string = $file;
$pattern = '#(\w)(\d+)_A\.jpg$#i';
// Use the second regex group for the results.
$replacement = '$2';
// This should return only group 2 (any number of digits), but instead group 1 is somehow always in there.
$new_file_part = preg_replace($pattern, $replacement, $string);
// Example good end result: <img src="images/ch/ch-07184fs.jpg" width="350" border="0">
// Save the rename results for further processing later.
$rename_candidates[$i]=array('file'=>$file, 'new_file'=>$new_file_part);
// Rename the images into a standard format.
echo "cp ".$file." ./ch/ch-".$new_file_part."fs.jpg;";
// Echo out some commands for later.
echo "<br>";
$i++;
if($i>10){break;} // Just deal with the first 10 for now.
}
?>
Intended result for the regex: 788750
Intended result for the code output (multiple lines of): cp air-something-something-C485850_A.jpg ./ch/ch-485850.jpg;
What's wrong with my regex? Suggestions for simpler matching code would be appreciated as well.
Just a guess:
$pattern = '#^.*?(\w)(\d+)_A\.jpg$#i';
This includes the whole filename in the match. Otherwise preg_replace() will really only substitute the end of each string - it only applies the $replacement expression on the part that was actually matched.
Scan Dir and Expode
You know what? A simpler way to do it in php is to use scandir and explode combo
$dir = scandir('/path/to/directory');
foreach($dir as $file)
{
$ext = pathinfo($file,PATHINFO_EXTENSION);
if($ext!='jpg') continue;
$a = explode('-',$file); //grab the end of the string after the -
$newfilename = end($a); //if there is no dash just take the whole string
$newlocation = './ch/ch-'.str_replace(array('C','_A'),'', basename($newfilename,'.jpg')).'fs.jpg';
echo "#copy($file, $newlocation)\n";
}
#and you are done :)
explode: basically a filename like blah-2.jpg is turned into a an array('blah','2.jpg); and then taking the end() of that gets the last element. It's the same almost as array_pop();
Working Example
Here's my ideaone code http://ideone.com/gLSxA
I'm trying to get a users ID from a string such as:
http://www.abcxyz.com/123456789/
To appear as 123456789 essentially stripping the info up to the first / and also removing the end /. I did have a look around on the net but there seems to be so many solutions but nothing answering both start and end.
Thanks :)
Update 1
The link can take two forms: mod_rewrite as above and also "http://www.abcxyz.com/profile?user_id=123456789"
I would use parse_url() to cleanly extract the path component from the URL:
$path = parse_URL("http://www.example.com/123456789/", PHP_URL_PATH);
and then split the path into its elements using explode():
$path = trim($path, "/"); // Remove starting and trailing slashes
$path_exploded = explode("/", $path);
and then output the first component of the path:
echo $path_exploded[0]; // Will output 123456789
this method will work in edge cases like
http://www.example.com/123456789?test
http://www.example.com//123456789
www.example.com/123456789/abcdef
and even
/123456789/abcdef
$string = 'http://www.abcxyz.com/123456789/';
$parts = array_filter(explode('/', $string));
$id = array_pop($parts);
If the ID always is the last member of the URL
$url="http://www.abcxyz.com/123456789/";
$id=preg_replace(",.*/([0-9]+)/$,","\\1",$url);
echo $id;
If there is no other numbers in the URL, you can also do
echo filter_var('http://www.abcxyz.com/123456789/', FILTER_SANITIZE_NUMBER_INT);
to strip out everything that is not a digit.
That might be somewhat quicker than using the parse_url+parse_str combination.
If your domain does not contain any numbers, you can handle both situations (with or without user_id) using:
<?php
$string1 = 'http://www.abcxyz.com/123456789/';
$string2 = 'http://www.abcxyz.com/profile?user_id=123456789';
preg_match('/[0-9]+/',$string1,$matches);
print_r($matches[0]);
preg_match('/[0-9]+/',$string2,$matches);
print_r($matches[0]);
?>