search array for duplicates php - php

It's been years since I've used PHP and I am more than a little rusty.
I am trying to write a quick script that will open a large file and split it into an array and then look for similar occurrences in each value. For example, the file consist of something like this:
Chapter 1. The Beginning
Art. 1.1 The story of the apple
Art. 1.2 The story of the banana
Art. 1.3 The story of the pear
Chapter 2. The middle
Art. 1.1 The apple gets eaten
Art. 1.2 The banana gets split
Art. 1.3 Looks like the end for the pear!
Chapter 3. The End
…
I would like the script to automatically tell me that two of the values have the string "apple" in it and return "Art. 1.1 The Story of the apple" and "Art. 1.1 The apple gets eaten", and then also does the same for the banana and pear.
I am not looking to search through the array for a specific string I just need it to count occurrences and return what and where.
I have already got the script to open a file and then split it into an array. Just cant figure out how to find similar occurrences.
<?php
$file = fopen("./index.txt", "r");
$blah = array();
while (!feof($file)) {
$blah[] = fgets($file);
}
fclose($file);
var_dump($blah);
?>
Any help would be appreciated.

This solution is not perfect as it counts every single word in the text, so maybe you will have to modify it to better serve your needs, but it gives accurate statistic about how many times each word is mentioned in the file and also exactly on which rows.
$blah = file('./index.txt') ;
$stats = array();
foreach ($blah as $key=>$row) {
$words = array_map('trim', explode(' ', $row));
foreach ($words as $word)
if (empty($stats[$word])) {
$stats[$word]['rows'] = $key.", ";
$stats[$word]['count'] = 1;
} else {
$stats[$word]['rows'] .= $key.", ";
$stats[$word]['count']++;
}
}
print_r($stats);
I hope this idea will help you to get going on and polish it further to better suit your needs!

Related

I want to turn an array $badwords into a file to be called

I have a form that sends an email. I have a list of words to ban, and they are manually entered in an array. Each word in the array gets a point which eventually can reject a mail send. I want to put the words into a file instead to call up because though this works, its slow to update especially across several domains. Sorry for my lack of skill. Thanks.
$badwords = array("word1", "word2", "word3");
foreach ($badwords as $word)
if (strpos(strtolower($_POST['comments']), $word) !== false
As the badwords add up, the point value increase to a limit which then rejects the send.
Excuse me, I was not clear evidently. I want to take the EXISTING array of badwords and put them in a file, in some sort of order and entry (line per line, or comma separated?). I want to call that file to be read by the existing script.
So maybe it theoretically looks like :
$badwords = badwords.php and so on....
Thanks
I'm not sure if that's what you need? Try it.
This code should solve what you need. Find 'badwords' from the 'bedwords' list in the 'message', calculate the occurrence of word each of the 'bedwords' and add 1 penalty point to the '$ penalty' for each positive result (even duplicate).
the code ignores uppercase and lowercase letters.
Set the list:
$badwords = ['world', 'car', 'cat', 'train',];
$message = "World is small. I love music and my car. But I also love to
travel by train. I like animals, especially my cat.";
We will initialize the variable for counting penalty points.
$penalty = 0;
Now we need to go through the 'message' as many as there are in the 'badwords' fields. We will use the 'for' loop.
for($k =0; $k <= count($badwords) - 1; $k++):
preg_match_all("/$badwords[$k]/i", $message, $out[])
endfor;
We have now passed a total of 5 (from 0 to 4) through the message loop. Using a regular expression, we store word matches in an 'out' array, creating a multidimensional array. Now you need to go through this "out" field. We reduce its dimensions.
foreach ($out as &$value):
$value = $value[0];
endforeach;
We will now go through this out field again using the 'for' loop and calculate the number of values in each dimension. Based on the calculated values we will assign 1 penalty point for each match and a duplicate.
for($n = 0; $n <= count($out)-1; $n++):
$penalty += count($out[$n]);
endfor;
The result is the number of points awarded.
Here is source of the php, on PHP Fiddle
http://phpfiddle.org/main/code/jzyw-hva6
In words.php:
<?php
$words = ["filter","these","words","out"];
In your main script:
<?php
include "words.php";
print_r($words);
Result:
Array
(
[0] => filter
[1] => these
[2] => words
[3] => out
)
figured it out.
in the root of my webspace I made a file called words.php
<?php
$badwords = array("000", "adult", etc
then added an include (as there are counted words, so can be more than one) to my main file
include "../badwords.php"; // the array list was here
and on to the foreach statement.
and removed this original line from that main file.
$badwords = array("word1", "word2", "word3");
Seems to be working. Thanks

How can I remove duplicated lines in a file using PHP (including the "original' one)?

Well, my question is very simple, but I didn't find the proper answer in nowhere. What I need is to find a way that reads a .txt file, and if there's a duplicated line, remove ALL of them, not preserving one. For example, in a .txt contains the following:
1234
1233
1232
1234
The output should be:
1233
1232
Because the code has to delete the duplicated line, all of them. I searched all the web, but it always point to answers that removes duplicated lines but preserve one of them, like this, this or that.
I'm afraid that the only way to do this is to read the x line and check the whole .txt, if it finds an equal result, delete, and delete the x line too. If not, change to the next line. But the .txt file I'm checking has 50 milions lines (~900Mb), I don't know how much memory I need to do this kind of task, so I appreciate some help here.
Read the file line by line, and use the line contents as the key of an associative array whose values are a count of the number of times the line appears. After you're done, write out all the lines whose value is only 1. This will require as much memory as all the unique lines.
$lines = array();
$fd = fopen("inputfile.txdt", "r");
while ($line = fgets($fd)) {
$line = rtrim($line, "\r\n"); // ignore the newline
if (array_key_exists($line, $lines)) {
$lines[$line]++;
} else {
$lines[$line] = 1;
}
}
fclose($fd);
$fd = fopen("outputfile.txt", "w");
foreach ($lines as $line => $count) {
if ($count == 1) {
fputs($fd, "$line" . PHP_EOL); // add the newlines back
}
}
I doubt there is one and only one function that does all of what you want to do. So, this breaks it down into steps...
First, can we load a file directly into an array? See the documentation for the file command
$lines = file('mytextfile.txt');
Now, I have all of the lines in an array. I want to count how many of each entry I have. See the documentation for the array_count_values command.
$counts = array_count_values($lines);
Now, I can easily loop through the array and delete any entries where the count>1
foreach($counts as $value=>$cnt)
if($cnt>1)
unset($counts[$value]);
Now, I can turn the array keys (which are the values) into an array.
$nondupes = array_keys($counts);
Finally, I can write the contents out to a file.
file_put_contents('myoutputfile.txt', $nondupes);
I think I have a solution far more elegant:
$array = array('1', '1', '2', '2', '3', '4'); // array with some unique values, some not unique
$array_count_result = array_count_values($array); // count values occurences
$result = array_keys(array_filter($array_count_result, function ($value) { return ($value == 1); })); // filter and isolate only unique values
print_r($result);
gives:
Array
(
[0] => 3
[1] => 4
)

PHP - Most Efficient Dictionary Code

I'm using the following code to pull the definition of a word from a tab-delimited file with only two columns (word, definition). Is this the most efficient code for what I'm trying to do?
<?php
$haystack = file("dictionary.txt");
$needle = 'apple';
$flipped_haystack = array_flip($haystack);
foreach($haystack as $value)
{
$haystack = explode("\t", $value);
if ($haystack[0] == $needle)
{
echo "Definition of $needle: $haystack[1]";
$defined = "1";
break;
}
}
if($defined != "1")
{
echo "$needle not found!";
}
?>
Right now you're doing a lot of pointless work
1) load the file into a per-line array
2) flip the array
3) iterate over and explode every value of the array
4) test that exploded value
You can't really avoid step 1, but why do you have to do all that useless "busy work" for 2&3?
e.g. if your dictionary text was set up something like this:
word:definition
then a simple:
$matches = preg_grep('/^$word:(.*)$/', $haystack);
would do the trick for you, with far less code.
No. Most likely a trie is more efficient and you didn't sort your dictionary and it doesn't use a binary tree or ternary tree. I guess if you need to search in a huge dictionary your method is simply too slow.
Is this the most efficient code for what I'm trying to do?
Surely not.
To find only one needle you are processing all the entries.
I will be building up to have 100,000+ entries.
use a database then.

PHP Array Generator

I have some values in a excel file and I want all of them to be array element remember file also have other data in it.
I know one way is that copy them one by one and put into array initialize statment
A sample list which is just a part of whole list
Bay Area
Greater Boston
Greater Chicago
Greater Dallas-Ft. Worth
Greater D.C.
Las Vegas
Greater Houston
Greater LA
Greater New York
Greater San Diego
Seattle
South Florida
It is easy to initialize array with values when there are not much items like
$array= array('Bay Area '=>'Bay Area ','Greater Boston'=>'Greater Boston',....)
// and so on
But I have 70-80 of items it is very tedious task to initialize array like above.
So, Guys Is there any alternate or short cut to assign array with the list of values?
Is there any auto array generator tool?
If you copied them to a file with each one its own line you could read the file in php like this
$myArray = file('myTextFile');
//sets the keys to the array equal to the values
$myArray = array_combine($myArray, $myArray);
You could export the data from excel to a CSV, then read the file into php like so:
$myArray = array();
if (($file = fopen("myTextFile", "r")) !== FALSE) {
while (($data = fgetcsv($file)) !== FALSE) {
foreach($data as $value) {
$myArray[$value] = $value;
}
}
fclose($handle);
}
$array = explode("\n", file_get_contents('yourfile.txt'));
For more complex cases for loading CSV files in PHP, maybe use fgetcsv() or even PHPExcelReader for XLS files.
EDIT (after question edit)
(Removed my poor solution, as ohmusama's file() + array_combine() is clearly nicer)
This one:
$string_var = "
Bay Area
Greater Boston
Greater Chicago
Greater Dallas-Ft. Worth
";
$array_var = explode("\n", $string_var);
get notepad++, open the excel file there, and do a simple search and replace with regex. something like search for "(.*)\n" and replace with "'(\1)'," (" quoutes not included), this would give you a long list of:
'Bay Area','Greater Boston','Greater Chicago'
This would be the fastest way of creating the array in terms of php execution time.
I think it's looks better:
$a[] = "Bay Area";
$a[] = "Greater Boston";
$a[] = "Greater Chicago";
For creating such text file, use Excel (I don't have Excel, but it looks somewhat):
=concatenate(" $a[] = ",chr(34),A1,chr(34),";")
Then export only that column.

Remove composed words

I have a list of words in which some are composed words, in example
palanca
plato
platopalanca
I need to remove "plato" and "palanca" and let only "platopalanca".
Used array_unique to remove duplicates, but those composed words are tricky...
Should I sort the list by word length and compare one by one?
A regular expression is the answer?
update: The list of words is much bigger and mixed, not only related words
update 2: I can safely implode the array into a string.
update 3: I'm trying to avoid doing this as if this was a bobble sort. there must be a more effective way of doing this
Well, I think that a buble-sort like approach is the only possible one :-(
I don't like it, but it's what i have...
Any better approach?
function sortByLengthDesc($a,$b){
return strlen($a)-strlen($b);
}
usort($words,'sortByLengthDesc');
$count = count($words);
for($i=0;$i<=$count;$i++) {
for($j=$i+1;$j<$count;$j++) {
if(strstr($words[$j], $words[$i]) ){
$delete[]=$i;
}
}
}
foreach($delete as $i) {
unset($words[$i]);
}
update 5: Sorry all. I'm A moron. Jonathan Swift make me realize I was asking the wrong question.
Given x words which START the same, I need to remove the shortests ones.
"hot, dog, stand, hotdogstand" should become "dog, stand, hotdogstand"
"car, pet, carpet" should become "pet, carpet"
"palanca, plato, platopalanca" should become "palanca, platopalanca"
"platoother, other" should be untouchedm they both start different
I think you need to define the problem a little more, so that we can give a solid answer. Here are some pathological lists. Which items should get removed?:
hot, dog, hotdogstand.
hot, dog, stand, hotdogstand
hot, dogs, stand, hotdogstand
SOME CODE
This code should be more efficient than the one you have:
$words = array('hatstand','hat','stand','hot','dog','cat','hotdogstand','catbasket');
$count = count($words);
for ($i=0; $i<=$count; $i++) {
if (isset($words[$i])) {
$len_i = strlen($words[$i]);
for ($j=$i+1; $j<$count; $j++) {
if (isset($words[$j])) {
$len_j = strlen($words[$j]);
if ($len_i<=$len_j) {
if (substr($words[$j],0,$len_i)==$words[$i]) {
unset($words[$i]);
}
} else {
if (substr($words[$i],0,$len_j)==$words[$j]) {
unset($words[$j]);
}
}
}
}
}
}
foreach ($words as $word) {
echo "$word<br>";
}
You could optimise this by storing word lengths in an array before the loops.
You can take each word and see, if any word in array starts with it or ends with it. If yes - this word should be removed (unset()).
You could put the words into an array, sort the array alphabetically and then loop through it checking if the next words start with the current index, thus being composed words. If they do, you can remove the word in the current index and the latter parts of the next words...
Something like this:
$array = array('palanca', 'plato', 'platopalanca');
// ok, the example array is already sorted alphabetically, but anyway...
sort($array);
// another array for words to be removed
$removearray = array();
// loop through the array, the last index won't have to be checked
for ($i = 0; $i < count($array) - 1; $i++) {
$current = $array[$i];
// use another loop in case there are more than one combined words
// if the words are case sensitive, use strpos() instead to compare
while ($i < count($array) && stripos($array[$i + 1], $current) === 0) {
// the next word starts with the current one, so remove current
$removearray[] = $current;
// get the other word to remove
$removearray[] = substr($next, strlen($current));
$i++;
}
}
// now just get rid of the words to be removed
// for example by joining the arrays and getting the unique words
$result = array_unique(array_merge($array, $removearray));
Regex could work. You can define within the regex where the start and end of the string applies.
^ defines the start
$ defines the end
so something like
foreach($array as $value)
{
//$term is the value that you want to remove
if(preg_match('/^' . $term . '$/', $value))
{
//Here you can be confident that $term is $value, and then either remove it from
//$array, or you can add all not-matched values to a new result array
}
}
would avoid your issue
But if you are just checking that two values are equal, == will work just as well as (and possibly faster than) preg_match
In the event that the list of $terms and $values are huge this won't come out to be the most efficient of strategies, but it is a simple solution.
If performance is an issue, sorting (note the provided sort function) the lists and then iterating down the lists side by side might be more useful. I'm going to actually test that idea before I post the code here.

Categories