Find intersection/frequency of a word in multiple file - php

<?php
$wordFrequencyArray = array();
function countWordsfrequency($filename) {
global $wordFrequencyArray;
$contentoffile = (file_get_contents($filename));
$wordArray = preg_split('/[^a-zA-Z0-9]/', $contentoffile, -1, NO_EMPTY);
foreach (array_count_values($wordArray) as $word => $count) {
if (!isset($wordFrequencyArray[$word])) $wordFrequencyArray[$word] = 0;
$wordFrequencyArray[$word] += $count;
}
}
$filenames = array('file1.txt', 'file2.txt','file3.txt','file4.txt');
foreach ($filenames as $filename) {
countWordsfrequency($filename);
}
print_r($wordFrequencyArray);
?>
This is the my code to find the frequency of each word in multiple files and print them.Now what i want to do is check find intersection that which word occurs in which files .For example if there is a word "stack" i want to print in which files it occurs and its frequency which i think i have already calculated.
Final result should be like the frequency followed by in which files that word occurs.
How should i proceed with it? Should i check it in the for loop in the countWords function itself.

You will have to save a little more information. I am going to shy away from using classes because it seems like you do not need anything too robust.
<?php
$wordFrequencies = array();
function countWordsFrequency($filename) {
global $wordFrequencies;
$contentoffile = (file_get_contents($filename));
$wordArray = preg_split('/[^a-zA-Z0-9]/', $contentoffile, -1, NO_EMPTY);
foreach (array_count_values($wordArray) as $word => $count) {
$wordFreqInfo = $wordFrequencies[$word];
if (!isset($wordFreqInfo)) {
$wordFreqInfo = array();
$wordFreqInfo['total'] = 0;
$wordFreqInfo['files'] = array();
$wordFrequencies[$word] = $wordFreqInfo;
}
// If this is the first occurence of this word in the file, start a count.
if (!isset($wordFreqInfo['files'][$filename]))
$wordFreqInfo['files'][$filename] = 0;
}
// Increment counts for both the total and the file.
$wordFreqInfo['total'] += $count;
$wordFreqInfo['files'][$filename] += $count;
}
}
$filenames = array('file1.txt', 'file2.txt','file3.txt','file4.txt');
foreach ($filenames as $filename) {
countWordsFrequency($filename);
}
print_r($wordFrequencies);
?>

Related

looping through txt file to use specific part of a string

I am new to Php and can't seem to figure this out no matter how much I've googled.
So I've opened the txt file (which consists of multiple lines of this type of string unique Identifier IMEI in bold:
Rx:00:39:54 06/09/2015:+RESP:GTEPS,210101,863286020022449,,8296,01,1,3,0.0,0,1031.1,29.367950,-30.799161,20150906003710,,,,,,2857.9,20150906003710,8038$) There are different strings with different IMEIs but i only want to use a specific one.
My question is, how do I extract/only use the string with the same Unique identifier and then loop through those to use in another function?
My function has different cases and each case has different calculations, so I'll need to loop through the txt file (with e.g. 863286020022449 as Identifier, ignoring other identifiers/IMEIs) in order to use the string in my function as below:
This is my starter function:
function GetParam($unknownFunction, $numberCommas) {
$returnString = "";
$foundSting = 0;
$numberFound = 0;
$len = strlen($unknownFunction);
for ($i = 0; $i < $len; ++$i) {
if ($Rawline[$i] == ",") {
++$numberFound;
if ($numberFound > $numberCommas)
break;
if ($numberFound == $numberCommas)
$foundSting = 1;
}
else if ($foundSting == 1) {
$returnString .= $unknownFunction[$i];
}
}
return $returnString;
echo $returnString;
}
$i = strpos($unknownFunction, ":GT");
$p = substr($unknownFunction, $i+3,3);
$Protocol = GetParam($unknownFunction, 1);
//this switch reads the differences in the message types (e.g. HBD- in this case is a heartbeat message type and would thus have a different amount of commas in the string and has different definitions of the characters within the commas)
switch ($p) {
case 'HBD':
//+ACK:GTHBD,220100,135790246811220,,20100214093254,11F0$
//This is an example of an HBD message
$result2["Type"] = 'Heart beat';
$IMEI = GetParam($unknownFunction, 2);
$mDate = GetParam($unknownFunction, 4);
$mDate = substr($mDate,0,4).'-'.substr($mDate,4,2).'-
'.substr($mDate,6,2).'
'.substr($mDate,8,2).':'.substr($mDate,10,2).':'.substr($mDate,12,2);
break;
This is the biggest problem I am facing at the moment and when I print the different lines, it indicates the correct IMEI but it does not loop through the whole file to use each string that belongs to that IMEI.
Your assistance would be greatly appreciated.
Thank you so much.
Example of input file:
Rx:00:00:00 28/02/2018:+RESP:GTFRI,3C0103,862045030241360,,14067,11,1,1,29.7,320,151.1,30.949307,-29.819685,20180227235959,0655,0001,013A,87B6,00,35484.1,01500:51:31,,,100,220101,,,,20180228000000,3461$
Rx:00:00:01 28/02/2018:+RESP:GTERI,380201,869606020047340,gv65,00000002,14076,10,1,1,119.0,119,24.3,18.668516,-34.016808,20180227235955,0655,0001,00F7,2DC9,00,98912.0,02235:20:25,0,100,220101,0,0,20180227235958,FF20$
Rx:00:00:03 28/02/2018:+RESP:GTERI,380201,869606020162990,,00000002,12912,10,1,1,0.0,230,1127.3,30.846671,-27.674206,20180227235956,0655,0001,013E,88B0,00,106651.1,03546:44:42,0,100,210101,0,0,20180227235959,6190$
Rx:00:00:03 28/02/2018:+ACK:GTHBD,450102,865084030005340,gb100,20180228000003,CC61$
Rx:00:00:03 28/02/2018:+RESP:GTERI,380201,869606020115980,,00000002,13640,10,1,1,12.1,353,1663.1,28.580726,-28.162208,20180227235957,,,,,,37599.6,02422:07:24,0,100,220101,0,0,20180228000000,1937$
Rx:00:00:04 28/02/2018:+RESP:GTERI,380502,869606020276840,gv65,00000002,12723,10,1,1,0.0,106,1232.8,22.878013,-27.951762,20180227235952,0655,0001,0204,63C5,00,13808.9,00778:32:20,0,100,210100,0,0,20180228000002,2C50$
Rx:00:00:04 28/02/2018:+RESP:GTERI,380502,869606020274530,gv65,00000002,12683,10,1,1,0.0,91,1213.7,24.863444,-28.174319,20180227235956,0655,0001,0203,69F1,00,9753.2,00673:49:21,0,100,210100,0,0,20180228000003,8AC7$
Rx:00:00:05 28/02/2018:+ACK:GTHBD,380201,863286023083810,,20180228000003,0D87$
Rx:00:00:06 28/02/2018:+RESP:GTFRI,3C0103,862045030241360,,14086,10,1,1,34.0,327,152.0,30.949152,-29.819501,20180228000002,0655,0001,013A,87B6,00,35484.1,01500:51:36,,,100,220101,,,,20180228000005,3462$
Rx:00:00:06 28/02/2018:+ACK:GTHBD,060228,862894021626380,,20180228000007,F9A5$
Rx:00:00:07 28/02/2018:+RESP:GTERI,380201,869606020019430,,00000002,12653,10,1,1,0.0,219,1338.7,26.882063,-28.138099,20180228000002,,,,,,86473.7,05645:48:34,0,93,210101,0,0,20180228000003,0FA5$
Rx:00:00:09 28/02/2018:+ACK:GTHBD,380502,869606020233940,gv65,20180228000008,7416$
Rx:00:00:10 28/02/2018:+RESP:GTAIS,380201,869606020171710,,11,11,1,1,0.0,95,281.2,30.855164,-29.896575,20180228000009,0655,0001,0156,9A9F,00,156073.7,20180228000008,F9A4$
Each GT message means something which is why i need to extract only one specific IMEI and use the result in my function as a breakdown of what every set of numbers between the commas actually mean. The end result needs to be populated in an excel spreadsheet but that's a different issue.
Nested foreach, keeping tracking of the IMEIs you've already gone through. Or something like this.
<?php
$filename = 'info.txt';
$contents = file($filename);
foreach ($contents as $line) {
$doneAlreadyArray = array();
$IMEI = GetParam($line, 2);
foreach ($contents as $IMEIline){
$thisIMEI = GetParam($IMEIline,2);
//check if already done the IMEI previously
if (!in_array($thisIMEI, $doneAlreadyArray)){
//matching IMEIs?
if ($thisIMEI == $IMEI){
//run new function with entire $IMEIline
new_function($IMEIline);
}
}
}
//add IMEI to doneAlreadyArray
array_push($doneAlreadyArray,$IMEI);
}
?>
If I've understood your question right and you want to extract the string(line) with the same Unique identifier, this may be useful for your needs as a strating point.
The example is very basic, and use data from your question:
<?php
// Read the file.
$filename = 'input.txt';
$file = file($filename, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
// Each item of $output will contain an array of lines:
$output = array();
foreach ($file as $row) {
$a = explode(',', $row);
$imei = $a[2];
if (!array_key_exists($imei, $output)) {
$output[$imei] = array();
}
$output[$imei][] = $row;
}
// Then do what you want ...
foreach ($output as $key=>$value) {
echo 'IMEI: '.$key.'</br>';
foreach($value as $row) {
// Here you can call your functions. I just echo the row:
echo $row.'</br>';
}
}
?>
thank you for the feedback.
Ryan Dewberry ended up helping me.
The fix was simpler than I thought too :)
//Unknownfunction is now $line
function GetParam($line, $numberCommas) {
$returnString = "";
$foundSting = 0;
$numberFound = 0;
$len = strlen($line);
for ($i = 0; $i < $len; ++$i) {
if ($line[$i] == ",") {
++$numberFound;
if ($numberFound > $numberCommas)
break;
if ($numberFound == $numberCommas)
$foundSting = 1;
}
else if ($foundSting == 1) {
$returnString .= $line[$i];
}
}
return $returnString;
// print $returnString;
}
//this is new - makes sure I use the correct IMEI
$contents = file($fileName);
foreach ($contents as $line){
$haveData = 0;
$IMEI = GetParam($line, 2);
if ($IMEI == $gprsid){
$i = strpos($line, ":GT");
$p = substr($line, $i+3,3);
$Protocol = GetParam($line, 1);
//this is the part I struggled with as well - This is an array of all of my
//calculation
//results and in printing it out I can see that everything is working
$superResult = array();
array_push($superResult,$result2);
print_r($superResult);
}
}
Much appreciated. Thank you!

php foreach preg_match'd line, get next lines

I hope the title is self explanatory.
I would like to loop over a xml file line by line, then match a particular line (getting attributes from that line), then get the next X lines after that line.
I have the following code, which attempts to do this, but I cant seem to figure out how to get the next X lines after.
$file = 'Electric.xml';
$lines = file($file);//file in to an array
foreach($lines as $line){
$reads = element_attributes('WINDOW',$line);
if($reads['class'] == 'Bracelets'){
print_r($reads);
}
if($reads['class'] == 'Handbags'){
print_r($reads);
}
}
function element_attributes($element_name, $xml) {
if ($xml == false) {
return false;
}
// Grab the string of attributes inside an element tag.
$found = preg_match('#<'.$element_name.
'\s+([^>]+(?:"|\'))\s?/?>#',
$xml, $matches);
if ($found == 1) {
$attribute_array = array();
$attribute_string = $matches[1];
// Match attribute-name attribute-value pairs.
$found = preg_match_all(
'#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#',
$attribute_string, $matches, PREG_SET_ORDER);
if ($found != 0) {
// Create an associative array that matches attribute
// names to attribute values.
foreach ($matches as $attribute) {
$attribute_array[$attribute[1]] =
substr($attribute[2], 1, -1);
}
return $attribute_array;
}
}
// Attributes either weren't found, or couldn't be extracted
// by the regular expression.
return false;
}
Use a proper parser, like SimpleXML, to parse the file. Then your issue becomes trivial. The PHP manual contains a tutorial to help you get started.
In this case you just loop over the lines, checking the property of the tag you're looking for, until you find a match. Then, loop over the next # elements, saving them into an array.
Something like this:
$xml = new SimpleXML ("file.xml");
foreach ($xml->node->element as $element) {
if ($element->attribute != "match") {
continue;
}
// If we get here we want to save the next # lines/elements.
}
$linesLength = count($lines);
$XLines = array();
for($index = 0; $index < $linesLength; $index++){
$reads = element_attributes('WINDOW',$line);
if($reads['class'] == 'Bracelets'){
print_r($reads);
$XLines[] = array_slice($array, $index, $X);
$index += $X;
}
if($reads['class'] == 'Handbags'){
print_r($reads);
$XLines[] = array_slice($array, $index, $X);
$index += $X;
}
}

Removing an array from a PHP JSON object

So a bit of background information is I'm creating a web app and I have 50~ arrays that I'm currently using what I get from an API, I've created a script to find the arrays that I don't need lets call them "bad arrays" but the problem is I'm unsure how I can filter these arrays out with the method I'm using to search through them
I'm searching through them with this script
$tagItems = [];
foreach($tags['items'] as $item) {
if (!$item['snippet']['tags'] || !is_array($item['snippet']['tags'])) {
continue;
}
foreach($item['snippet']['tags'] as $tag) {
$tag = strtolower($tag);
if (!isset($tagItems[$tag])) {
$tagItems[$tag] = 0;
}
$tagItems[$tag]++;
}
}
But let's say I didn't want it to include the 8th array and the 15th array
$tags['items'][8]['snippet']['tags'];
$tags['items'][15]['snippet']['tags'];
I want these to be removed from the original $tags array. How can i achieve this?
EDIT: This needs to be dynamic. I do not know if there are going to be 45/50 arrays that will need removing or just 2/50. the array that needs removing can be reffered to as $index
I have a script which determines what array(s) need to be removed
$i = 0;
while ($i <= 50) {
$x = 0;
while ($x <= 50) {
if ($tags['items'][$i]['snippet']['channelId'] == $tags['items'][$x]['snippet']['channelId']) {
if ($x < $i) {
break;
} else {
echo $x.", ";
break;
}
}
$x++;
}
$i++;
}
I'm going to edit this a little more to provide some extra information that may be useful. My overall goal is to use the YouTube API to remove all but the first array of tags where the channel id appears multiple times. I'm using a script which finds all the array numbers that dont need to be removed an URL.
You can check for the array key
$tagItems = [];
$toRemove = array(8,15);
foreach($tags['items'] as $key => $item) {
if(in_array($key,$toRemove)){
continue;
}
if (!$item['snippet']['tags'] || !is_array($item['snippet']['tags'])) {
continue;
}
foreach($item['snippet']['tags'] as $tag) {
$tag = strtolower($tag);
if (!isset($tagItems[$tag])) {
$tagItems[$tag] = 0;
}
$tagItems[$tag]++;
}
}

php match string to multiple array of keywords

I'm writing a basic categorization tool that will take a title and then compare it to an array of keywords. Example:
$cat['dining'] = array('food','restaurant','brunch','meal','cand(y|ies)');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
Are there creative ways to loop through these categories or to see which category has the most matches? Note that in the 'dining' array, I have regex to match variations on the word candy. I tried the following, but with these category lists getting pretty long, I'm wondering if this is the best way:
$keywordRegex = implode("|",$cat['dining']);
preg_match_all("/(\b{$keywordRegex}\b)/i",$string,$matches]);
Thanks,
Steve
EDIT:
Thanks to #jmathai, I was able to add ranking:
$matches = array();
foreach($keywords as $k => $v) {
str_replace($v, '#####', $masterString,$count);
if($count > 0){
$matches[$k] = $count;
}
}
arsort($matches);
This can be done with a single loop.
I would split candy and candies into separate entries for efficiency. A clever trick would be to replace matches with some token. Let's use 10 #'s.
$cat['dining'] = array('food','restaurant','brunch','meal','candy','candies');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$max = array(null, 0); // category, occurences
foreach($cat as $k => $v) {
$replaced = str_replace($v, '##########', $string);
preg_match_all('/##########/i', $replaced, $matches);
if(count($matches[0]) > $max[1]) {
$max[0] = $k;
$max[1] = count($matches[0]);
}
}
echo "Category {$max[0]} has the most ({$max[1]}) matches.\n";
$cat['dining'] = array('food','restaurant','brunch','meal');
$cat['services'] = array('service','cleaners','framing','printing');
$string = 'Dinner at seafood restaurant';
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_intersect($string,$val));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);
Providing the number of words is not too great, then creating a reverse lookup table might be an idea, then run the title against it.
// One-time reverse category creation
$reverseCat = array();
foreach ($cat as $cCategory => $cWordList) {
foreach ($cWordList as $cWord) {
if (!array_key_exists($cWord, $reverseCat)) {
$reverseCat[$cWord] = array($cCategory);
} else if (!in_array($cCategory, $reverseCat[$cWord])) {
$reverseCat[$cWord][] = $cCategory;
}
}
}
// Processing a title
$stringWords = preg_split("/\b/", $string);
$matchingCategories = array();
foreach ($stringWords as $cWord) {
if (array_key_exists($cWord, $reverseCat)) {
$matchingCategories = array_merge($matchingCategories, $reverseCat[$cWord]);
}
}
$matchingCategories = array_unique($matchingCategories);
You are performing O(n*m) lookup on n being the size of your categories and m being the size of a title. You could try organizing them like this:
const $DINING = 0;
const $SERVICES = 1;
$categories = array(
"food" => $DINING,
"restaurant" => $DINING,
"service" => $SERVICES,
);
Then for each word in a title, check $categories[$word] to find the category - this gets you O(m).
Okay here's my new answer that lets you use regex in $cat[n] values...there's only one caveat about this code that I can't figure out...for some reason, it fails if you have any kind of metacharacter or character class at the beginning of your $cat[n] value.
Example: .*food will not work. But s.afood or sea.* etc... or your example of cand(y|ies) will work. I sort of figured this would be good enough for you since I figured the point of the regex was to handle different tenses of words, and the beginnings of words rarely change in that case.
function rMatch ($a,$b) {
if (preg_match('~^'.$b.'$~i',$a)) return 0;
if ($a>$b) return 1;
return -1;
}
$string = explode(' ',$string);
foreach ($cat as $key => $val) {
$kwdMatches[$key] = count(array_uintersect($string,$val,'rMatch'));
}
arsort($kwdMatches);
echo "<pre>";
print_r($kwdMatches);

paragraph comparison in PHP

i was wondering... let's say i have a webpage that crawls articles from the web. all i get is the title and the article in plain-text. is there a PHP script or webservice that can relate articles between them? or... is there a PHP script that can generate keywords from a paragraph?
i have tested a script in JAVA that works, but maybe there's a PHPclass somewhere that can help...
thanks!
The functions from this answer can be used to extract words from text and compare them against each other. Rough example:
// For better results grab the texts manually and paste them here.
$nyt = file_get_contents('http://www.nytimes.com/2011/01/19/technology/19apple.html?pagewanted=print');
$sfc = file_get_contents('http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2011/01/19/BUAK1HARUL.DTL&type=business');
$nyt = strip_tags($nyt);
$sfc = strip_tags($sfc);
// stopwords from english snowball porter stemmer
$stopwordsFile = dirname(__FILE__).'/includes/stopwords_en.txt';
if (file_exists($stopwordsFile)) {
$stopwords = file($stopwordsFile, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
} else {
$stopwords = array();
}
$nytWords = extractWords($nyt, 3, $stopwords);
$sfcWords = extractWords($sfc, 3, $stopwords);
$nyt2sfcCount = countKeywords($nytWords, $sfcWords, 4);
$sfc2nytCount = countKeywords($sfcWords, $nytWords, 4);
// absolute
print_r($nyt2sfcCount);
print_r($sfc2nytCount);
$nyt2sfcFactor = strlen($sfc) / strlen($nyt);
$sfc2nytFactor = strlen($nyt) / strlen($sfc);
print($nyt2sfcFactor . PHP_EOL);
print($sfc2nytFactor . PHP_EOL);
foreach ($nyt2sfcCount as $word => $count) {
$nyt2sfcCountRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCount as $word => $count) {
$sfc2nytCountRel[$word] = $count * $sfc2nytFactor;
}
// relative
print_r($nyt2sfcCountRel);
print_r($sfc2nytCount);
print_r($nyt2sfcCount);
print_r($sfc2nytCountRel);
// reduce
$nyt2sfcCountRed = array_intersect_key($nyt2sfcCount, $sfc2nytCount);
$sfc2nytCountRed = array_intersect_key($sfc2nytCount, $nyt2sfcCount);
// reduced absolute
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRed);
foreach ($nyt2sfcCountRed as $word => $count) {
$nyt2sfcCountRedRel[$word] = $count * $nyt2sfcFactor;
}
foreach ($sfc2nytCountRed as $word => $count) {
$sfc2nytCountRedRel[$word] = $count * $sfc2nytFactor;
}
// reduced relative
print_r($nyt2sfcCountRedRel);
print_r($sfc2nytCountRed);
print_r($nyt2sfcCountRed);
print_r($sfc2nytCountRedRel);

Categories