Eliminating commonly used words from a string in PHP, MySQL - php

I have code that takes a massive string from a SQL database and parses it into individual words and puts them into an array to be counted, with the goal of making a graph of the must used words, but I need to find a means of removing commonly used words. I made a very basic array of words to compare to but it's not very effective. Is their some means of a dictionary file i can compare it to? any ideas would be fantastic.
I am currently editing an existing "Data representation algorithm" at an internship and i really don't know where to start. It has been suggested I use a dictionary file but not only do I not have have one, I wouldn't know how to compare it.

You can do this using the in_array function:
<?php
$whitelist = array('a', 'the');
function whitelisted($var)
{
global $whitelist;
return (!in_array($var, $whitelist));
}
$str = "a lazy fox jumped over the lazy farmer";
print_r(array_count_values(array_filter(explode(" ", $str), "whitelisted")));
?>
//produces:
Array
(
[lazy] => 2
[fox] => 1
[jumped] => 1
[over] => 1
[farmer] => 1
)
Of course, you could and should re-arrange this to work with your own scope (global is probably not ideal), but it should get you started on pruning out common words you don't care to count.
http://ideone.com/kfNzM

Related

Is there a possible way to use multidimensional arrays to replace words in a string using str_replace() and the array() function?

Note: I have seen this question here, but my array is completely different.
Hello everyone. I have been trying to make a censoring program for my website. What I ended up with was:
$wordsList = [
array("frog","sock"),
array("Nock","crock"),
];
$message = str_replace($wordsList[0], $wordsList[1], "frog frog Nock Nock");
echo $message;
What I am trying to do is replace "frog" with "sock" using multidimentional arrays without typing all of the words out in str_replace();
Expected Output: "sock sock crocs crocs"
However, when I execute it, for some unknown reason it doesn't actually replace the words, without any errors. I think it's a rookie mistake that I made, but I have searched and have not found any documentation on using a system like this. Please help!
You need to change the structure of your wordsList array.
There are two structures that will make it easy:
As key/value pairs
This would be my recommendation since it's super clear what the strings and their replacements are.
// Store them as key/value pairs with the search and replacement strings
$wordsList = [
'frog' => 'sock',
'Nock' => 'crock',
];
$message = str_replace(
array_keys($wordsList), // Get all keys as the search array
$wordsList, // The replacements
"frog frog Nock Nock"
);
Demo: https://3v4l.org/WsUdn
As a multidimensional array
This one requires you to add the search/replacement values in the same order, which can be hard to read when you have a few different strings.
$wordsList = [
['frog', 'Nock'], // All search strings
['sock', 'crock'], // All replacements
];
$message = str_replace(
$wordsList[0], // All search strings
$wordsList[1], // The replacements strings
"frog frog Nock Nock"
);
Demo: https://3v4l.org/RQjC6
If you can't change the original array, then create a new array with the correct structure since that won't work "as is".

Searching keywords from one array against values of another array - php

I have 2 arrays. One with bad keywords and the other with names of sites.
$bad_keywords = array('google',
'twitter',
'facebook');
$sites = array('youtube.com', 'google.com', 'm.google.co.uk', 'walmart.com', 'thezoo.com', 'etc.com');
Simple task: I need to filter through the $sites array and filter out any value that contains any keyword that is found in the $bad_keywords array. At the end of it I need an array with clean values that I would not find any bad_keywords occurring at all.
I have scoured the web and can't seem to find a simple easy solution for this. Here are several methods that I have tried:
1. using 2 foreach loops (feels slower - I think using in-built php functions will speed it up)
2. array_walk
3. array_filter
But I haven't managed to nail down the best, most efficient way. I want to have a tool that will filter through a list of 20k+ sites against a list of keywords that may be up to 1k long, so performance is paramount. Also, what would be the better method for the actual search in this case - regex or strpos?
What other options are there to do this and what would be the best way?
Short solution using preg_grep function:
$result = preg_grep('/'. implode('|', $bad_keywords) .'/', $sites, 1);
print_r($result);
The output:
Array
(
[0] => youtube.com
[3] => walmart.com
[4] => thezoo.com
[5] => etc.com
)

Php regexp get the strings from array print_r like string

Im trying to list out here how to match strings that looks like array printr.
variable_data[0][var_name]
I would like to get from above example 3 strings, variable_data, 0 and var_name.
That above example is saved in DB so i same structure of array could be recreated but im stuck. Also a if case should look up IF the string (as above) is in that structure, otherwise no preg_match is needed.
Note: i dont want to serialize that array since the array 'may' contain some characters that might break it when unserializing and i also need that value in the array to be fully visible.
Any one with regexp skills who might know the approach ?
Solution:
(\b([\w]*[\w]).\b([\w]*[\w]).+(\b[\w]*[\w]))
Thos 2 first indexes should be skipped... but i still get what i want :)
Not for nothing but couldn't you just do..
$result = explode('[', someString);
foreach ($result as $i => $v) {
$temp = str_replace(']'. ''. $result[$i]);
//Do something with temp
}
Obviously you need to edit the above a little bit depending on what you are doing but it is very simple and even gives you the same flexibility and you don't need to invoke the matching engine...
I don't think we build regex's here for people... instead please see http://regexpal.com/ for a Regex tester / builder with visual aid.
Furthermore people usually don't know how to use them properly which is then fostered by others creating the expressions for them.
Please remember complex expressions can have terrible performance overheads although there is nothing seemingly complex about your request...
Then after it is compelte post your completed RegEx and answer your own question for maximum 1337ne$$ :)
But since I am nice here is your reward:
\[.+\]\[\d+\]
or
[a-z]+_[a-z]+\[.+\]\[\d+\]
Depending on what you want to match out of the string (which you didn't specify) so I assumed all
Both perform as follows:
arr_var[name][0]; //Matched
arr_var[name]; //Not matched
arr_var[name][0][1];//Matched
arr_var[name][2220][11];//Matched
Again, test them and understand with visual aid at the above link.
Solution:
(\b([\w]*[\w]).\b([\w]*[\w]).+(\b[\w]*[\w]))
Those 2 first indexes should be skipped... but i still get what i want :)
Edit
Here is improved one:
$str = "variable[group1][parent][child][grandchild]";
preg_match_all('/(\b([\w]*[\w]))/', $str,$matches);
echo '<pre>';
print_r($matches);
echo '</pre>';
// Output
Array
(
[0] => variable
[1] => group1
[2] => parent
[3] => child
[4] => grandchild
)

PHP - Break up word into array

i've done plenty of googling and whatnot and can't find quite what i'm looking for...
I am working on tightening up the authentication for my website. I decided to take the user's credentials, and hash/salt the heck out of them. Then store these values in the DB and in user cookies. I modified a script I found on the PHP website and it's working great so far. I noticed however when using array_rand, that it would select the chars from the predefined string, sequentially. I didn't like that, so I decided to use a shuffle on the array_rand'd array. Worked great.
Next! I thought it would be clever to turn my user inputted password into an array, then merge that with my salted array! Well, I am having trouble turning my user's password into an array. I want each character in their password to be an array entry. IE, if your password was "cool", the array would be, Array 0 => c 1 => o 2 => o 3 => l, etc etc. I have tried word to split up the string then explode it with the specified break character, that didn't work. I figure I could do something with a for loop, strlen and whatnot, but there HAS to be a more elegant way.
Any suggestions? I'm kind of stumped :( Here is what I have so far, i'm not done with this as I haven't progressed further than the explodey part.
$strings = wordwrap($string, 1, "|");
echo $strings . "<br />";
$stringe = explode("|", $strings, 1);
print_r($stringe);
echo "<br />";
echo "This is the exploded password string, for mixing with salt.<hr />";
Thank you so much :)
The php function you want is str_split
str_split('cool', 1);
And it would return, is used as above
[0] => c
[1] => o
[2] => o
[3] => l
Thanks to PHP's loose typing if you treat the string as an array, php will hapilly do what you would expect. For example:
$string = 'cool';
echo $string[1]; // output's 'o'.
Never, EVER implement (or in this case, design too!) cryptographical algorithms unless you really know what you are doing. If you decide to go ahead and do it anyways, you're putting your website at risk.
There's no reason you should have to do this: there is most certainly libraries and/or functions to do all of this sort of thing already.

Find 3-8 word common phrases in body of text using PHP

I'm looking for a way to find common phrases within a body of text using PHP. If it's not possible in php, I'd be interested in other web languages that would help me complete this.
Memory or speed are not an issues.
Right now, I'm able to easily find keywords, but don't know how to go about searching phrases.
I've written a PHP script that does just that, right here. It first splits the source text into an array of words and their occurrence count. Then it counts common sequences of those words with the specified parameters. It's old code and not commented, but maybe you'll find it useful.
Using just PHP? The most straightforward I can come up with is:
Add each phrase to an array
Get the first phrase from the array and remove it
Find the number of phrases that match it and remove those, keeping a count of matches
Push the phrase and the number of matches to a new array
Repeat until initial array is empty
I'm trash for formal CS, but I believe this is of n^2 complexity, specifically involving n(n-1)/2 comparisons in the worst case. I have no doubt there is some better way to do this, but you mentioned that efficiency is a non-issue, so this'll do.
Code follows (I used a new function to me, array_keys that accepts a search parameter):
// assign the source text to $text
$text = file_get_contents('mytext.txt');
// there are other ways to do this, like preg_match_all,
// but this is computationally the simplest
$phrases = explode('.', $text);
// filter the phrases
// if you're in PHP5, you can use a foreach loop here
$num_phrases = count($phrases);
for($i = 0; $i < $num_phrases; $i++) {
$phrases[$i] = trim($phrases[$i]);
}
$counts = array();
while(count($phrases) > 0) {
$p = array_shift($phrases);
$keys = array_keys($phrases, $p);
$c = count($keys);
$counts[$p] = $c + 1;
if($c > 0) {
foreach($keys as $key) {
unset($phrases[$key]);
}
}
}
print_r($counts);
View it in action: http://ideone.com/htDSC
I think you should go for
str_word_count
$str = "Hello friend, you're
looking good today!";
print_r(str_word_count($str, 1));
will give
Array
(
[0] => Hello
[1] => friend
[2] => you're
[3] => looking
[4] => good
[5] => today
)
Then you can use array_count_values()
$array = array(1, "hello", 1, "world", "hello");
print_r(array_count_values($array));
which will give you
Array
(
[1] => 2
[hello] => 2
[world] => 1
)
An ugly solution, since you said ugly is ok, would be to search for the first word for any of your phrases. Then, once that word is found, check if the next word past it matches the next expected word in the phrase. This would be a loop that would keep going so long as the hits are positive until either a word is not present or the phrase is completed.
Simple, but exceedingly ugly and probably very, very slow.
Coming in late here, but since I stumbled upon this while looking to do a similar thing, I thought I'd share where I landed in 2019:
https://packagist.org/packages/yooper/php-text-analysis
This library made my task downright trivial. In my case, I had an array of search phrases that I wound up breaking up into single terms, normalizing, then creating two and three-word ngrams. Looping through the resulting ngrams, I was able to easily summarize the frequency of specific phrases.
$words = tokenize($searchPhraseText);
$words = normalize_tokens($words);
$ngram2 = array_unique(ngrams($words, 2));
$ngram3 = array_unique(ngrams($words, 3));
Really cool library with a lot to offer.
If you want fulltext search in html files, use Sphinx - powerful search server.
Documentation is here

Categories