remove subdomain from url PHP - php

I'm cleaning URL from a text file with a PHP script, here's the code right now :
$file = __DIR__."/url.txt";
$f = fopen($file, "r");
$array1 = array();
while ( $line = fgets($f, 1000) ) {
$nl = mb_strtolower($line,'UTF-8');
$array1[] = $nl;
}
foreach ($array1 as $value) {
$value = preg_replace('#^https?://#', '', $value);
$value = preg_replace('#^www.#', '', $value);
echo $value."<br>";
}
So I remove the http:// and www from these urls
Here's the output :
urlnumberone.com
urlnumbertwo.uk
subdomain.urlnumberthree.com
urlnumberfour.com
What I want is to remove subdomain too, and just have urlnumberthree.com
Thanks for your help !

Pure regex solution:
preg_match('#[^\.]+[\.]{1}[^\.]+$#', $value , $matches);
$value = $matches[0];
This replaces both of your preg_replaces.

Check occurrences of '.' if they are more than 1, remove the beginning until first dot.

Related

Get value from file - php

Let's say I have this in my text file:
Author:MJMZ
Author URL:http://abc.co
Version: 1.0
How can I get the string "MJMZ" if I look for the string "Author"?
I already tried the solution from another question (Php get value from text file) but with no success.
The problem may be because of the strpos function. In my case, the word "Author" got two. So the strpos function can't solve my problem.
Split each line at the : using explode, then check if the prefix matches what you're searching for:
$lines = file($filename, FILE_IGNORE_NEW_LINES);
foreach($lines as $line) {
list($prefix, $data) = explode(':', $line);
if (trim($prefix) == "Author") {
echo $data;
break;
}
}
Try the following:
$file_contents = file_get_contents('myfilename.ext');
preg_match('/^Author\s*\:\s*([^\r\n]+)/', $file_contents, $matches);
$code = isset($matches[1]) && !empty($matches[1]) ? $matches[1] : 'no-code-found';
echo $code;
Now the $matches variable should contains the MJMZ.
The above, will search for the first instance of the Author:CODE_HERE in your file, and will place the CODE_HERE in the $matches variable.
More specific, the regex. will search for a string that starts with the word Author followed with an optional space \s*, followed by a semicolon character \:, followed by an optional space \s*, followed by one or more characters that it is not a new line [^\r\n]+.
If your file will have dinamically added items, then you can sort it into array.
$content = file_get_contents("myfile.txt");
$line = explode("\n", $content);
$item = new Array();
foreach($line as $l){
$var = explode(":", $l);
$value = "";
for($i=1; $i<sizeof($var); $i++){
$value .= $var[$i];
}
$item[$var[0]] = $value;
}
// Now you can access every single item with his name:
print $item["Author"];
The for loop inside the foreach loop is needed, so you can have multiple ":" in your list. The program will separate name from value at the first ":"
First take lines from file, convert to array then call them by their keys.
$handle = fopen("file.txt", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
$pieces = explode(":", $line);
$array[$pieces[0]] = $pieces[1];
}
} else {
// error opening the file.
}
fclose($handle);
echo $array['Author'];

How to Remove Duplicate Domains from a Large List of URLs

I want To remove duplicate domains in the list of URL ,For example Below is the text file
http://www.exampleurl.com/something.php
http://www.domain.com/something.php
http://www.exampleurl.com/something111.php
http://www.exampleurl.com/something111.php
http://www.exampleurl.com/something222.php
I need to remove duplicate domain and i need below list
http://www.exampleurl.com/something.php
http://www.domain.com/something.php
Below is the code that just remove duplicates in an text file.
$text = array_unique(file($filename));
$f = #fopen("promo1.txt",'w+');
if ($f) {
fputs($f, join('',$text));
fclose($f);
}
?>
Can anyone help me ?
$urls = file('domains.txt');
$uniqueDomains = array_reduce (
$urls,
function (array $list, $url) {
$domain = parse_url($domain, PHP_URL_HOST);
if (!isset($list[$domain])) $list[$domain] = $url;
return $list;
},
array()
);
$uniqueDomains has the hostname as key. If you don't need (and/or want) it use array_values($uniqueDomains);
To compare on the domains, you can use parse_url:
<?php
$text = file_get_contents("input.txt");
$lines = explode("\n",$text);
$filtered_domains = array();
foreach($lines as $line)
{
$parsed_url = parse_url($line);
if(array_search($parsed_url['host'], $filtered_domains) === false)
{
$filtered_domains[$line] = $parsed_url['host'];
}
}
$output = implode("\n", array_keys($filtered_domains));
file_put_contents("output.txt", $output);
?>
<?php
/*
$lines = file('textfile.txt');
*/
$lines = array(
'http://www.exampleurl.com/something.php',
'http://www.domain.com/something.php',
'http://www.exampleurl.com/something111.php',
'http://www.exampleurl.com/something111.php',
'http://www.exampleurl.com/something222.php'
);
foreach($lines as $line){
$url_parsed = parse_url($line);
if(is_array($url_parsed)){
$host = $url_parsed['host'];
if(!#$uniques[$host]){
$uniques[$host] = $line;
}
}
}
echo join('',$uniques);
$f = #fopen("promo1.txt",'w+');
if ($f) {
fputs($f, join("\n",$uniques));
fclose($f);
}
?>
To remove duplicates from an array you can use array_unique(). To make your list an array you can use explode(). Then to make it a string again you can use implode().
To put this all together you can use the following code:
$list = "http://www.exampleurl.com/something.php
http://www.domain.com/something.php
http://www.exampleurl.com/something111.php
http://www.exampleurl.com/something111.php
http://www.exampleurl.com/something222.php";
$newList = implode("\n", array_unique(explode("\n", $list)));

Parse an txt file with tags in PHP

I have a .txt file that is like this:
Title: Test
Author: zad0xsis
Date: July 13th, 2011
Body: This is a test post and this can continue until the file end
How could I make PHP to recognize the "tags" and make the content to a new string? Thanks in advance! :D
$fc = file('some_file.txt'); // read file into array
foreach ($fc as $line) {
list($tag, $content) = explode(':', $line, 2);
// do something here
}
Now, are there multiple unrelated sets in each file? If so, you'll have to look for some marker, maybe a new line, and do a reset. Hopefully you can figure this part out on your own.
Some functions for you to check out:
file
file_get_contents
explode
list (not really a function)
Edit: slightly expanding the example:
$fc = file('some_file.txt'); // read file into array
foreach ($fc as $index => $line) {
list($tag, $content) = explode(':', $line, 2);
// do something here
if ('body' == strtolower($tag)) {
$content = join(array_slice($fc, $index + 1, count($fc)));
break;
}
}
More functions for you!
strtolower
join (aka implode)
array_slice
trim - this is not used in my solution, but you may want to use it to trim the newline chars from the end of the lines as returned by file(). Alternatively, you can use the FILE_IGNORE_NEW_LINES flag when calling file(), and more information on that can be found in the PHP Manual entry for file() (also linked above).
Another solution: demo here
<?php
//$sample = file_get_contents('myfile.txt'); // read from file
$sample = "Title: Test
Author: zad0xsis
Date: July 13th, 2011
Body: This is a test post and this can continue until the file end";
$re = '/^(?<tag>\w+):\s?(?<content>.*)$/m';
$matches = null;
if (preg_match_all($re, $sample, $matches))
{
for ($_ = 0; $_ < count($matches['tag']); $_++)
printf("TAG: %s\r\nCONTENT: %s\r\n\r\n", $matches['tag'][$_], $matches['content'][$_]);
}
produces:
TAG: Title
CONTENT: Test
TAG: Author
CONTENT: zad0xsis
TAG: Date
CONTENT: July 13th, 2011
TAG: Body
CONTENT: This is a test post and this can continue until the file end
Thought I'd use named tags just for GPs. Also, if need-be, you can replace the (?<tag>\w+) with something more vague such as (?<tag>.*?) if there could be spaces, numbers, etc.
$file = file("file.txt");
foreach($file as $line)
{
preg_match("|(.*?): (.*?)|", $line, $match);
$tag = $match[1];
$content = $match[2];
}
<?php
$tagValue = array();
$file = fopen("welcome.txt", "r") or exit("Unable to open file!");
while(!feof($file))
{
$line = fgets($file);
$tagDelimiter = strpos ($line ,":");
$tag = substr($line,0,$tagDelimiter);
$value = substr($line,$tagDelimiter+1,strlen($line)-$tagDelimiter);
$tagValue[$tag] = $value;
}
fclose($file);
?>
You can access your data : $tagValue["Title"]
you can do this:
$file = file('file.txt');
foreach($file as $line)
{
if(preg_match('/(.*) : (.*)/iUs', $line, $match)
{
$tag = $match[1];
$value = $match[2]
}
}
Use strpos() and substr():
function parse($filename)
{
$lines = file($filename);
$content = array();
foreach ($lines as $line)
{
$posColon = strpos($line, ":");
$tag = substr($line, 0, $posColon);
$body = substr($line, $posColon+1);
$content[$tag] = trim($body);
}
return $content;
}

Function for each subfolder in PHP

I am new in PHP and can't figure out how to do this:
$link = 'http://www.domainname.com/folder1/folder2/folder3/folder4';
$domain_and_slash = http://www.domainname.com . '/';
$address_without_site_url = str_replace($domain_and_slash, '', $link);
foreach ($folder_adress) {
// function here for example
echo $folder_adress;
}
I can't figure out how to get the $folder_adress.
In the case above I want the function to echo these four:
folder1
folder1/folder2
folder1/folder2/folder3
folder1/folder2/folder3/folder4
The $link will have different amount of subfolders...
This gets you there. Some things you might explore more: explode, parse_url, trim. Taking a look at the docs of there functions gets you a better understanding how to handle url's and how the code below works.
$link = 'http://www.domainname.com/folder1/folder2/folder3/folder4';
$parts = parse_url($link);
$pathParts = explode('/', trim($parts['path'], '/'));
$buffer = "";
foreach ($pathParts as $part) {
$buffer .= $part.'/';
echo $buffer . PHP_EOL;
}
/*
Output:
folder1/
folder1/folder2/
folder1/folder2/folder3/
folder1/folder2/folder3/folder4/
*/
You should have a look on explode() function
array explode ( string $delimiter , string $string [, int $limit ] )
Returns an array of strings, each of
which is a substring of string formed
by splitting it on boundaries formed
by the string delimiter.
Use / as the delimiter.
This is what you are looking for:
$link = 'http://www.domainname.com/folder1/folder2/folder3/folder4';
$domain_and_slash = 'http://www.domainname.com' . '/';
$address_without_site_url = str_replace($domain_and_slash, '', $link);
// this splits the string into an array
$address_without_site_url_array = explode('/', $address_without_site_url);
$folder_adress = '';
// now we loop through the array we have and append each item to the string $folder_adress
foreach ($address_without_site_url_array as $item) {
// function here for example
$folder_adress .= $item.'/';
echo $folder_adress;
}
Hope that helps.
Try this:
$parts = explode("/", "folder1/folder2/folder3/folder4");
$base = "";
for($i=0;$i<count($parts);$i++){
$base .= ($base ? "/" : "") . $parts[$i];
echo $base . "<br/>";
}
I would use preg_match() for regular expression method:
$m = preg_match('%http://([.+?])/([.+?])/([.+?])/([.+?])/([.+?])/?%',$link)
// $m[1]: domain.ext
// $m[2]: folder1
// $m[3]: folder2
// $m[4]: folder3
// $m[5]: folder4
1) List approach: use split to get an array of folders, then concatenate them in a loop.
2) String approach: use strpos with an offset parameter which changes from 0 to 1 + last position where a slash was found, then use substr to extract the part of the folder string.
EDIT:
<?php
$folders = 'folder1/folder2/folder3/folder4';
function fn($folder) {
echo $folder, "\n";
}
echo "\narray approach\n";
$folder_array = split('/', $folders);
foreach ($folder_array as $folder) {
if ($result != '')
$result .= '/';
$result .= $folder;
fn($result);
}
echo "\nstring approach\n";
$pos = 0;
while ($pos = strpos($folders, '/', $pos)) {
fn(substr($folders, 0, $pos++));
}
fn($folders);
?>
If I had time, I could do a cleaner job. But this works and gets across come ideas: http://codepad.org/ITJVCccT
Use parse_url, trim, explode, array_pop, and implode

PHP Remove URL from string

If I have a string that contains a url (for examples sake, we'll call it $url) such as;
$url = "Here is a funny site http://www.tunyurl.com/34934";
How do i remove the URL from the string?
Difficulty is, urls might also show up without the http://, such as ;
$url = "Here is another funny site www.tinyurl.com/55555";
There is no HTML present. How would i start a search if http or www exists, then remove the text/numbers/symbols until the first space?
I re-read the question, here is a function that would work as intended:
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
return implode(' ',$U);
}
$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
Edit #2/#3 (I must be bored). Here is a version that verifies there is a TLD within the URL:
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) { //only preg_match if there is a dot
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
returns:
Cleaned: Here is another funny site badurl.badone but this and and
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
Parsing text for URLs is hard and looking for pre-existing, heavily tested code that already does this for you would be better than writing your own code and missing edge cases. For example, I would take a look at the process in Django's urlize, which wraps URLs in anchors. You could port it over to PHP, and--instead of wrapping URLs in an anchor--just delete them from the text.
thanks mike,
update a bit, it return notice error,
'/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i'
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
$url = "Here is a funny site http://www.tunyurl.com/34934";
$replace = 'http www .com .org .net';
$with = '';
$clean_url = clean($url,$replace,$with);
echo $clean_url;
function clean($url,$replace,$with) {
$replace = explode(" ",$replace);
$new_string = '';
$check = explode(" ",$url);
foreach($check AS $key => $value) {
foreach($replace AS $key2 => $value2 ) {
if (-1 < strpos( strtolower($value), strtolower($value2) ) ) {
$value = $with;
break;
}
}
$new_string .= " ".$value;
}
return $new_string;
}
You would need to write a regular expression to extract out the urls.

Categories