php array unique for urls - php

I need to identify unique urls from an array.
All of the following variants should count as equal:
http://google.com
https://google.com
http://www.google.com
https://www.google.com
www.google.com
google.com
I have the following solution:
public static function array_unique_url(array $array) : array
{
$uniqueArray = [];
foreach($array as $item) {
if(!self::in_array_url($item, $uniqueArray)){
$uniqueArray[] = $item;
}
}
return $uniqueArray;
}
public static function in_array_url(string $needle, array $haystack): bool {
$haystack = array_map([self::class, 'normalizeUrl'], $haystack);
$needle = self::normalizeUrl($needle);
return in_array($needle, $haystack);
}
public static function normalizeUrl(string $url) {
$url = strtolower($url);
return preg_replace('#^(https?://)?(www.)?#', '', $url);
}
However, this is not very efficient O(n^2). Can anybody point me to a better solution?

in_array is expensive. Instead of doing that create a hash and store values as their counts.
Something like:
$myHash = []; //a global array to hold values.
And while checking, Do this:
if(!empty($myHash[$needle] )){
//already exits
}

I haven't test it, but maybe something like this will work:
function getUniqueUrls(array $urls)
{
$unique_urls = [];
foreach ($urls as $url) {
$normalized_url = preg_replace('#^(https?://)?(www.)?#', '', strtolower($url));
$unique_urls[$normalized_url] = true;
}
return array_keys($unique_urls);
}
$arr = [
'http://google.com',
'https://google.com',
'http://www.google.com',
'https://www.google.com',
'www.google.com',
'google.com'
];
$unique_urls = getUniqueUrls($arr);

Here is a simplified version. It does not use preg_replace as it costs a lot. Also it does not make any unnecessary string operation.
$urls = array(
"http://google.com",
"https://google.com",
"http://www.google.com",
"https://www.google.com",
"www.google.com",
"google.com"
);
$uniqueUrls = array();
foreach($urls as $url) {
$subPos = 0;
if(($pos = stripos($url, "://")) !== false) {
$subPos = $pos + 3;
}
if(($pos = stripos($url, "www.", $subPos)) !== false) {
$subPos = $pos + 4;
}
$subStr = strtolower(substr($url, $subPos));
if(!in_array($subStr, $uniqueUrls)) {
$uniqueUrls[] = $subStr;
}
}
var_dump($uniqueUrls);
Another performance optimization could be implementing binary search on the unique urls because 'in_array' search the whole array as it is not sorted.

<?php
$urls = [
'http://google.com',
'https://google.com',
'http://www.google.com',
'https://www.google.com',
'www.google.com',
'google.com',
'testing.com:9200'
];
$uniqueUrls = [];
foreach ($urls as $url) {
$urlData = parse_url($url);
$urlHostName = array_key_exists('host',$urlData) ? $urlData['host'] : $urlData['path'];
$host = str_replace('www.', '', $urlHostName);
if(!in_array($host, $uniqueUrls) && $host != ''){
array_push($uniqueUrls, $host);
}
}
print_r($uniqueUrls);
?>

why you normlize your result array everytime?
here is a better solution with your code:
public static function array_unique_url(array $array): array
{
$uniqueArray = [];
foreach ($array as $item) {
if (!isset($uniqueArray[$item])) {
$uniqueArray[$item] = self::normalizeUrl($item);
}
}
return $uniqueArray;
}
public static function normalizeUrl(string $url)
{
return preg_replace('#^(https?://)?(www.)?#', '', strtolower($url));
}
When you want your original items you can use array_keys(array_unique_url($array))
for your normalized urls you don't need array_keys

Try this simplest solution. Here we are using two functions preg_replace and parse_url for achieving desired output
Try this code snippet here
<?php
$urls = array(
"http://google.com",
"https://google.com",
"http://www.google.com",
"https://www.google.com",
"www.google.com",
"google.com"
);
$uniqueUrls=array();
foreach($urls as $url)
{
$changedUrl= preg_replace("/^(https?:\/\/)?/", "http://", $url);//adding http to urls which does not contains.
$domain= preg_replace("/^(www\.)?/","",parse_url($changedUrl,PHP_URL_HOST));//getting the desired host and then removing its www.
preg_match("/^[a-zA-Z0-9]+/", $domain,$matches);//filtering on the basis of domains
$uniqueUrls[$matches[0]]=$domain;
}
print_r(array_values($uniqueUrls));

Related

PHP get possible string combination of given array which match with given string

I have an array which contains bunch of strings, and I would like to find all of the possible combinations no matter how it's being sorted that match with given string/word.
$dictionary = ['flow', 'stack', 'stackover', 'over', 'code'];
input: stackoverflow
output:
#1 -> ['stack', 'over', 'flow']
#2 -> ['stackover', 'flow']
What I've tried is, I need to exclude the array's element which doesn't contain in an input string, then tried to match every single merged element with it but I'm not sure and get stuck with this. Can anyone help me to figure the way out of this? thank you in advance, here are my code so far
<?php
$dict = ['flow', 'stack', 'stackover', 'over', 'code'];
$word = 'stackoverflow';
$dictHas = [];
foreach ($dict as $w) {
if (strpos($word, $w) !== false) {
$dictHas[] = $w;
}
}
$result = [];
foreach ($dictHas as $el) {
foreach ($dictHas as $wo) {
$merge = $el . $wo;
if ($merge == $word) {
} elseif ((strpos($word, $merge) !== false) {
}
}
}
print_r($result);
For problems like this you want to use backtracking
function splitString($string, $dict)
{
$result = [];
//if the string is already empty return empty array
if (empty($string)) {
return $result;
}
foreach ($dict as $idx => $term) {
if (strpos($string, $term) === 0) {
//if the term is at the start of string
//get the rest of string
$substr = substr($string, strlen($term));
//if all of string has been processed return only current term
if (empty($substr)) {
return [[$term]];
}
//get the dictionary without used term
$subDict = $dict;
unset($subDict[$idx]);
//get results of splitting the rest of string
$sub = splitString($substr, $subDict);
//merge them with current term
if (!empty($sub)) {
foreach ($sub as $subResult) {
$result[] = array_merge([$term], $subResult);
}
}
}
}
return $result;
}
$input = "stackoverflow";
$dict = ['flow', 'stack', 'stackover', 'over', 'code'];
$output = splitString($input, $dict);

PHP url array search and return closest url

I have the following array with urls
$data = Array ( 'http://localhost/my_system/users',
'http://localhost/my_system/users/add_user',
'http://localhost/my_system/users/groups',
'http://localhost/my_system/users/add_group' );
Then I have a variable
$url = 'http://localhost/my_system/users/by_letter/s';
I need a function that will return the closest url from the array if $url does not exist. Something like
function get_closest_url($url,$data){
}
get_closest_url($url,$data); //returns 'http://localhost/my_system/users/'
$url2 = 'http://localhost/my_system/users/groups/ungrouped';
get_closest_url($url2,$data); //returns 'http://localhost/my_system/users/groups/'
$url3 = 'http://localhost/my_system/users/groups/add_group/x/y/z';
get_closest_url($url3,$data); //returns 'http://localhost/my_system/users/groups/add_group/'
You can explode both the current URL and each of the URLs in $data, intersect the arrays, then return the array with the most elements (best match). If there's no matches, return false:
<?php
$data = [ "localhost/my_system/users",
"localhost/my_system/users/add_user",
"localhost/my_system/users/by_letter/groups",
"localhost/my_system/users/add_group"];
$url = "localhost/my_system/users/by_letter/s";
function getClosestURL($url, $data) {
$matches = [];
$explodedURL = explode("/", $url);
foreach ($data as $match) {
$explodedMatch = explode("/", $match);
$matches[] = array_intersect($explodedMatch, $explodedURL);
}
$bestMatch = max($matches);
return count($bestMatch) > 0 ? implode("/", $bestMatch) : false; // only return the path if there are matches, otherwise false
}
var_dump(getClosestURL($url, $data)); //returns localhost/my_system/users/by_letter
var_dump(getClosestURL("local/no/match", $data)); //returns false
Demo
You don't mention how you want to specifically check if the URL exists. If it needs to be "live", you can use get_headers() and check the first item for the HTTP status. If it's not 200, you can then go ahead with the URL intersection.
$headers = get_headers($url);
$httpStatus = substr($headers[0], 9, 3);
if ($httpStatus === "200") {
return $url; // $url is OK
}
// else, keep going with the previous function
function get_closest_url($item,$possibilities){
$result = [];
foreach($possibilities as $possibility){
$lev = levenshtein($possibility, $item);
if($lev === 0){
#### we have got an exact match
return $possibility;
}
#### if two possibilities have the same lev we return only one
$result[$lev] = $possibility;
}
#### return the highest
return $result[min(array_keys($result))];
}
That should do it.

Unset array Items matching a pattern [duplicate]

This question already has answers here:
filter values from an array similar to SQL LIKE '%search%' using PHP
(4 answers)
Closed last month.
I have the following Array :
Array
{
[0]=>"www.abc.com/directory/test";
[1]=>"www.abc.com/test";
[2]=>"www.abc.com/directory/test";
[3]=>"www.abc.com/test";
}
I only want the items that have something in middle in URL like /directory/ and unset the items that do not have that.
Output should be like:
Array
{
[0]=>"www.abc.com/directory/test";
[1]=>"www.abc.com/directory/test";
}
An example without closures. Sometimes you just need to understand the basics first, before you can move on to the neater stuff.
$newArray = array();
foreach($array as $value) {
if ( strpos( $value, '/directory/') ) {
$newArray[] = $value;
}
}
Try using array_filter this:
$result = array_filter($data, function($el) {
$parts = parse_url($el);
return substr_count($parts['path'], '/') > 1;
});
If you have something inside path will allways contain at least 2 slashes.
So for input data
$data = Array(
"http://www.abc.com/directory/test",
"www.abc.com/test",
"www.abc.com/directory/test",
"www.abc.com/test/123"
);
you output will be
Array
(
[0] => http://www.abc.com/directory/test
[2] => www.abc.com/directory/test
[3] => www.abc.com/test/123
)
A couple of approaches:
$urls = array(
'www.abc.com/directory/test',
'www.abc.com/test',
'www.abc.com/foo/directory/test',
'www.abc.com/foo/test',
);
$matches = array();
// if you want /directory/ to appear anywhere:
foreach ($urls as $url) {
if (strpos($url, '/directory/')) {
$matches[] = $url;
}
}
var_dump($matches);
$matches = array();
// if you want /directory/ to be the first path:
foreach ($urls as $url) {
// make the strings valid URLs
if (0 !== strpos($url, 'http://')) {
$url = 'http://' . $url;
}
$parts = parse_url($url);
if (isset($parts['path']) && substr($parts['path'], 0, 11) === '/directory/') {
$matches[] = $url;
}
}
var_dump($matches);
<?php
$array = Array("www.abc.com/directory/test",
"www.abc.com/test",
"www.abc.com/directory/test",
"www.abc.com/test",
);
var_dump($array);
array_walk($array, function($val,$key) use(&$array){
if (!strpos($val, 'directory')) {
unset($array[$key]);
}
});
var_dump($array);
php >= 5.3.0

an array of parameter values

function test()
{
$content = "lang=en]text en|lang=sp]text sp";
$atts = explode('|', $content);
}
What I'm trying to do is to allow myself to echo $param[en] to get "text en", $param[sp] to get "text sp". Is that possible?
the $content is actually from a database record.
$param = array();
$langs = explode('|', $content);
foreach ($langs as $lang) {
$arr = explode(']', $lang);
$key = substr($arr[0], 5);
$param[$key] = $arr[1];
}
This is if you are sure $content is well-formatted. Otherwise you will need to put in additional checks to make sure $langs and $arr are what they should be. Use the following to quickly check what's inside an array:
echo '<pre>'.print_r($array_to_be_inspected, true).'</pre>';
Hope this helps
if this is not hard coded string in $content
function test()
{
$content = "lang=en]text en|lang=sp]text sp";
$atts = explode('|', $content);
foreach($atts as $att){
$tempLang = explode("]", $att);
$params[array_pop(explode("=", $tempLang[0]))] = $tempLang[1];
}
var_dump($params);
}
I think in this case you could use regular expressions.
$atts = explode('|', $content);
foreach ($atts as $subtext) {
if (preg_match('/lang=(\w+)\](\w+) /', $subtext, $regs)) {
$param[$regs[0]] = $regs[1];
}
}
Although it seems that you have a bad database structure if that value comes from a database - if you can edit it, try to make the database adhere to make the database normal.

php -> delete items from array which contain words from a blacklist

I have got an array with several twitter tweets and want to delete all tweets in this array which contain one of the following words blacklist|blackwords|somemore
who could help me with this case?
Here's a suggestion:
<?php
$banned_words = 'blacklist|blackwords|somemore';
$tweets = array( 'A normal tweet', 'This tweet uses blackwords' );
$blacklist = explode( '|', $banned_words );
// Check each tweet
foreach ( $tweets as $key => $text )
{
// Search the tweet for each banned word
foreach ( $blacklist as $badword )
{
if ( stristr( $text, $badword ) )
{
// Remove the offending tweet from the array
unset( $tweets[$key] );
}
}
}
?>
You can use array_filter() function:
$badwords = ... // initialize badwords array here
function filter($text)
{
global $badwords;
foreach ($badwords as $word) {
return strpos($text, $word) === false;
}
}
$result = array_filter($tweetsArray, "filter");
use array_filter
Check this sample
$tweets = array();
function safe($tweet) {
$badwords = array('foo', 'bar');
foreach ($badwords as $word) {
if (strpos($tweet, $word) !== false) {
// Baaaad
return false;
}
}
// OK
return true;
}
$safe_tweets = array_filter($tweets, 'safe'));
You can do it in a lot of ways, so without more information, I can give this really starting code:
$a = Array(" fafsblacklist hello hello", "white goodbye", "howdy?!!");
$clean = Array();
$blacklist = '/(blacklist|blackwords|somemore)/';
foreach($a as $i) {
if(!preg_match($blacklist, $i)) {
$clean[] = $i;
}
}
var_dump($clean);
Using regular expressions:
preg_grep($array,"/blacklist|blackwords|somemore/",PREG_GREP_INVERT)
But i warn you that this may be inneficient and you must take care of punctuation characters in the blacklist.

Categories