Searching keywords from one array against values of another array - php - php

I have 2 arrays. One with bad keywords and the other with names of sites.
$bad_keywords = array('google',
'twitter',
'facebook');
$sites = array('youtube.com', 'google.com', 'm.google.co.uk', 'walmart.com', 'thezoo.com', 'etc.com');
Simple task: I need to filter through the $sites array and filter out any value that contains any keyword that is found in the $bad_keywords array. At the end of it I need an array with clean values that I would not find any bad_keywords occurring at all.
I have scoured the web and can't seem to find a simple easy solution for this. Here are several methods that I have tried:
1. using 2 foreach loops (feels slower - I think using in-built php functions will speed it up)
2. array_walk
3. array_filter
But I haven't managed to nail down the best, most efficient way. I want to have a tool that will filter through a list of 20k+ sites against a list of keywords that may be up to 1k long, so performance is paramount. Also, what would be the better method for the actual search in this case - regex or strpos?
What other options are there to do this and what would be the best way?

Short solution using preg_grep function:
$result = preg_grep('/'. implode('|', $bad_keywords) .'/', $sites, 1);
print_r($result);
The output:
Array
(
[0] => youtube.com
[3] => walmart.com
[4] => thezoo.com
[5] => etc.com
)

Related

Loop through nested arrays in PHP

I have a very complex array that I need to loop through.
Array(
[1] => Array(
[1] => ""
[2] => Array(
[1] => ""
[2] => Array(
[1] => ""
)
)
)
)
I can't use nested loops because this array could contain hundreds of nested arrays. Also, the nested ones could contain nested arrays too.
This array presents comments and replies, Where replies could contain more replies.
Any thoughts?
You could use a \RecursiveArrayIterator, which is part of the PHP SPL, shipped non-optional, with the PHP core.
<?php
$arr = [
'lvl1-A' => [
'lvl2' => [
'lvl3' => 'done'
],
],
'lvl1-B' => 'done',
];
function traverse( \Traversable $it ): void {
while ( $it->valid() ) {
$it->hasChildren()
? print "{$it->key()} => \n" and traverse( $it->getChildren() )
: print "{$it->key()} => {$it->current()}\n";
$it->next();
}
}
$it = new \RecursiveArrayIterator( $arr );
$it->rewind();
traverse( $it );
print 'Done.';
Run and play this example in the REPL here: https://3v4l.org/cGtoi
The code is just meant to verbosely explain what you can expect to see. The Iterator walks each level. How you actually code it is up to you. Keep in mind that filtering or flattening the array (read: transforming it up front) might be another option. You could as well use a generator and emit each level and maybe go with Cooperative Multitasking/ Coroutines as PHP core maintainer nikic explained in his blog post.
ProTip: Monitor your RAM consumption with different variants in case your nested Array really is large and maybe requested often or should deliver results fast.
In case you really need to be fast, consider streaming the result, so you can process the output while you are still working on processing the input array.
A last option might be to split the actual array in chunks (like when you are streaming them), therefore processing smaller parts.
The case is quite complex, as you have to loop, but you can't or don't want to for some reasons:
... that I need to loop through
and
I can't use nested loops because this array could contain hundreds of nested arrays
It means you have to either handle your data differently, as you can pack that huge amount of data to be processed later.
If for some reasons it's not an option, you can consider to:
split somehow this big array into smaller arrays
check how does it work with json_encode and parsing string with str_* functions and regex
Your question contains too many things we can't be sure e.g. what exactly these subarrays contain, can you ignore some parts of them, can you change the code that creates huge array in first place etc.
Assuming on the other hand that you could loop. What could bother you? The memory usage, how long it will take etc.?
You can always use cron to run it daily etc. but the most important is to find the cause why you ended up with huge array in the first place.

How to split a search url into an associative array

So I would like to take a string like this,
q=Sugar Beet&qf=vegetables&range=time:[34-40]
and break it up into separate pieces that can be put into an associative array and sent to a Solr Server.
I want it to look like this
['q'] => ['Sugar Beets],
['qf'] => ['vegetables']
After using urlencode I get
q%3DSugar+Beet%26qf%3Dvegetables%26range%3Dtime%3A%5B34-40%5D
Now I was thinking I would make two separate arrays that would use preg_split() and take the information between the & and the = sign or the = and the & sign, but this leaves the problem of the final and first because they do not start with an & or end in an &.
After this, the plan was to take the two array and combine them with array_combine().
So, how could I do a preg_split that addresses the problem of the first and final entry of the string? Is this way of doing it going to be too demanding on the server? Thank you for any help.
PS: I am using Drupal ApacheSolr to do this, which is why I need to split these up. They need to be sent to an object that is going to build q and qf differently for instance.
You don't need a regular expression to parse query strings. PHP already has a built-in function that does exactly this. Use parse_str():
$str = 'q=Sugar Beet&qf=vegetables&range=time:[34-40]';
parse_str($str, $params);
print_r($params);
Produces the output:
Array
(
[q] => Sugar Beet
[qf] => vegetables
[range] => time:[34-40]
)
You could use the parse_url() function/.
also:
parse_str($_SERVER['QUERY_STRING'], $params);

Php regexp get the strings from array print_r like string

Im trying to list out here how to match strings that looks like array printr.
variable_data[0][var_name]
I would like to get from above example 3 strings, variable_data, 0 and var_name.
That above example is saved in DB so i same structure of array could be recreated but im stuck. Also a if case should look up IF the string (as above) is in that structure, otherwise no preg_match is needed.
Note: i dont want to serialize that array since the array 'may' contain some characters that might break it when unserializing and i also need that value in the array to be fully visible.
Any one with regexp skills who might know the approach ?
Solution:
(\b([\w]*[\w]).\b([\w]*[\w]).+(\b[\w]*[\w]))
Thos 2 first indexes should be skipped... but i still get what i want :)
Not for nothing but couldn't you just do..
$result = explode('[', someString);
foreach ($result as $i => $v) {
$temp = str_replace(']'. ''. $result[$i]);
//Do something with temp
}
Obviously you need to edit the above a little bit depending on what you are doing but it is very simple and even gives you the same flexibility and you don't need to invoke the matching engine...
I don't think we build regex's here for people... instead please see http://regexpal.com/ for a Regex tester / builder with visual aid.
Furthermore people usually don't know how to use them properly which is then fostered by others creating the expressions for them.
Please remember complex expressions can have terrible performance overheads although there is nothing seemingly complex about your request...
Then after it is compelte post your completed RegEx and answer your own question for maximum 1337ne$$ :)
But since I am nice here is your reward:
\[.+\]\[\d+\]
or
[a-z]+_[a-z]+\[.+\]\[\d+\]
Depending on what you want to match out of the string (which you didn't specify) so I assumed all
Both perform as follows:
arr_var[name][0]; //Matched
arr_var[name]; //Not matched
arr_var[name][0][1];//Matched
arr_var[name][2220][11];//Matched
Again, test them and understand with visual aid at the above link.
Solution:
(\b([\w]*[\w]).\b([\w]*[\w]).+(\b[\w]*[\w]))
Those 2 first indexes should be skipped... but i still get what i want :)
Edit
Here is improved one:
$str = "variable[group1][parent][child][grandchild]";
preg_match_all('/(\b([\w]*[\w]))/', $str,$matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
// Output
Array
(
[0] => variable
[1] => group1
[2] => parent
[3] => child
[4] => grandchild
)

Eliminating commonly used words from a string in PHP, MySQL

I have code that takes a massive string from a SQL database and parses it into individual words and puts them into an array to be counted, with the goal of making a graph of the must used words, but I need to find a means of removing commonly used words. I made a very basic array of words to compare to but it's not very effective. Is their some means of a dictionary file i can compare it to? any ideas would be fantastic.
I am currently editing an existing "Data representation algorithm" at an internship and i really don't know where to start. It has been suggested I use a dictionary file but not only do I not have have one, I wouldn't know how to compare it.
You can do this using the in_array function:
<?php
$whitelist = array('a', 'the');
function whitelisted($var)
{
global $whitelist;
return (!in_array($var, $whitelist));
}
$str = "a lazy fox jumped over the lazy farmer";
print_r(array_count_values(array_filter(explode(" ", $str), "whitelisted")));
?>
//produces:
Array
(
[lazy] => 2
[fox] => 1
[jumped] => 1
[over] => 1
[farmer] => 1
)
Of course, you could and should re-arrange this to work with your own scope (global is probably not ideal), but it should get you started on pruning out common words you don't care to count.
http://ideone.com/kfNzM

Find 3-8 word common phrases in body of text using PHP

I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.
Memory or speed are not an issues.
Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.
I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.
Using just PHP? The most straightforward I can come up with is:
Add each phrase to an array
Get the first phrase from the array and remove it
Find the number of phrases that match it and remove those, keeping a count of matches
Push the phrase and the number of matches to a new array
Repeat until initial array is empty
I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.
Code follows (I used a new function to me, array_keys that accepts a search parameter):
// assign the source text to $text
$text = file_get_contents('mytext.txt');
// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);
// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
$phrases[$i] = trim($phrases[$i]);
}
$counts = array();
while(count($phrases) > 0) {
$p = array_shift($phrases);
$keys = array_keys($phrases, $p);
$c = count($keys);
$counts[$p] = $c + 1;
if($c > 0) {
foreach($keys as $key) {
unset($phrases[$key]);
}
}
}
print_r($counts);
View it in action: http://ideone.com/htDSC
I think you should go for
str_word_count
$str = "Hello friend, you're
looking good today!";
print_r(str_word_count($str, 1));
will give
Array
(
[0] => Hello
[1] => friend
[2] => you're
[3] => looking
[4] => good
[5] => today
)
Then you can use array_count_values()
$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));
which will give you
Array
(
[1] => 2
[hello] => 2
[world] => 1
)
An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.
Simple, but exceedingly ugly and probably very, very slow.
Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:
https://packagist.org/packages/yooper/php-text-analysis
This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.
$words = tokenize($searchPhraseText);
$words = normalize_tokens($words);
$ngram2 = array_unique(ngrams($words, 2));
$ngram3 = array_unique(ngrams($words, 3));
Really cool library with a lot to offer.
If you want fulltext search in html files, use Sphinx - powerful search server.
Documentation is here

Categories