PHP & word counting from string

PHP & word counting from string - php

Trying to take numbers from a text file and see how many times they occur.
I've gotten to the point where I can print all of them out, but I want to display just the number once, and then maybe the occurrences after them (ie: Key | Amount; 317 | 42).
Not looking for an Answer per se, all learning is good, but if you figure one out for me, that would be awesome as well!

preg_match_all will return the number of matches against a string.
$count = preg_match_all("#$key#", $string);
print "{$key} - {$count}";

So if you're already extracting the data you need, you can do this using a (fairly) simple array:
$counts = array();
foreach ($keysFoundFromFile AS $key) {
if (!$counts[$key]) $counts[$key] = 0;
$counts[$key]++;
}
print_r($counts);
If you're already looping to extract the keys from the file, then you can simply assign them directly to the $counts array without making a second loop.

I think you're looking for the function substr_count().
Just a heads up though, if you're looking for "123" and it finds "4512367" it will match it as part of it. The alternative would be using RegEx and using word boundaries:
$count = preg_match_all('|\b'. preg_quote($num) .'\b|', $text);
(preg_quote() for good practice, \b for word boundaries so we can be assured that it's not a number embedded in another number.)

Related

PHP finding "words" within a string

I need to compare 2 lists of strings against each other and output strings which contain the strings searched for. should be very easy, i just can't figure it out.
to overly simplify it, let's use arrays. I am accessing an API with SOAP and running it against my own list contained in a table, but.... let's use arrays. the comparison is what i'm having trouble with.
hit submit button on listsearch.php and it executes.
ARRAY Mylist : TED, DEAD, FIRST, LAST, PUPPY
ARRAY TheirList..<br> teddybearnoose, <br>hauntedhouse, <br>hehasdeparted, <br>deadmouse, <br>walkingdead, <br>thegratefuldead, <br>firstkiss, <br>thinkfirst,<br> firsttobelast,<br> firstmanonthemoon, <br>firstreattempted, <br>somecrap, <br>something, <br>notdisplayed, <br>50000otherwords,<br> miscjunk
outputs as:
TEDdybearnoose<br>
haunTEDhouse<br>
hehasdeparTED<br>
DEADmouse<br>
walkingDEAD<br>
thegratefulDEAD<br>
FIRSTkiss<br>
thinkFIRST<br>
FIRSTtobeLAST <--- note<br>
FIRSTmanonthemoon<br>
FIRSTreattempTED <--- note<br>
<br>
only outputs strings which contain a string in my list, in any position. CAPS is just to make the words stand out to you. not important.
now, part 2?
same "TheirList", except i type a keyword into a text area, and select whether i want it at the beginning end or anywhere from a dropdown.
keywordsearch.php
search for: [ TED ] at: [beginning / end / anywhere] of string.
how would you make that one work?
Thanks in advance. This should be a breeze for most of you. I appreciate it. i'll try to answer questions promptly

You can use strpos() to find the position of a substring (docs).
It makes it very easy to check whether the substring occurred at the beginning or at the end of the string:
// String contains substring
strpos($string, $substring) !== false;
// String starts with substring
strpos($string, $substring) === 0;
// String ends with substring
strpos($string, $substring) === strlen($string) - strlen($substring);

Read corresponding value in PHP and add to running sum

I would like to have each word in a string cross-referenced in a file.
So, if I was given the string: Jumping jacks wake me up in the morning.
I use some regex to strip out the period. Also, the entire string is made lowercase.
I then go on to have the words separated into an array by using PHP's nifty explode() function.
Now, what I'm left with, is an array with the words used in the string.
From there I need to look up each value in the array and get a value for it and add it to a running sum. for() loop it is. Okay, this is where I get stuck...
The list ($wordlist) is structured like so:
wake#4 waking#3 0.125
morning#2 -0.125
There are \ts in between the word and the number. There can be more than one word per value.
What I need the PHP to do now is look up the number to each word in the array then pull that corresponding number back to add it to a running sum. What's the best way for me to go about this?
The answer should be easy enough, just finding the location of the string in the wordlist and then finding the tab and from there reading the int... I just need some guidance.
Thanks in advance.
EDIT: to clarify -- I don't want the sum of the values of the wordlist, rather, I'd like to look up my individual values as they correspond to the words in the sentence and THEN look them up in the list and add just those values; not all of them.

Edited answer based on your comment and question edit. The running sum is stored in an array called $sum where the key value of the "word" will store the value of its running sum. e.g $sum['wake'] will store the running sum for the word wake and so on.
$sum = array();
foreach($wordlist as $word) //Loop through each word in wordlist
{
// Getting the value for the word by matching pattern.
//The number value for each word is stored in an array $word_values, where the key is the word and value is the value for that word.
// The word is got by matching upto '#'. The first parenthesis matches the word - (\w+)
//The word is followed by #, single digit(\d), multiple spaces(\s+), then the number value(\S+ matches the rest of the non-space characters)
//The second parenthesis matches the number value for the word
preg_match('/(\w+)#\d\s+(\S+)/', $word, $match);
$word_ref = $match[1];
$word_ref_number = $match[2];
$word_values["$word_ref"] = $word_ref_number;
}
//Assuming $sentence_array to store the array of words used in your string example {"Jumping", "jacks", "wake", "me", "up", "in", "the", "morning"}
foreach ($sentence_array as $word)
{
if (!array_key_exists("$word", $sum)) $sum["$word"] = 0;
$sum["$word"] += $word_values["$word"];
}
Am assuming you would take care of case sensitivities, since you mentioned that you make the entire string lowercase, so am not including that here.

$sentence = 'Jumping jacks wake me up in the morning';
$words=array();
foreach( explode(' ',$sentence) as $w ){
if( !array_key_exists($w,$words) ){
$words[$w]++;
} else {
$words[$w]=1;
}
}
explodeby space, check if that word is in the words array as key; if so increment it's count(val); if not, set it's val as 1. Loop this for each of your sentences without redeclaring the $words=array()

I need a PHP regular expression to validate string format of 5 digits, one comma

I have a huge PHP input box on a webpage. This input should only take 5 digit string separated by commas:
00100,00247,90277,97030,00657
notice the last one has no comma at the end.
Is there a regular expression that can do this? Since the input box is very large and can take 100+ of these items, I want to validate it on the PHP server side before the database is queried and those avoid any SQL Injection tries.
Query is only run if only 5 numbers and a comma in the sequence, except for the last one.
These are a state's public water system ID's by the way.

I believe this will get the result you're looking for, though explode may be the better option.
/^(?:\d{5},)*\d{5}$/
This will only match 1 or more 5-digit numbers that are comma delimited with no spaces.

Since this is user submitted data, your validation should be more flexible. What if the user accidentally puts a space after one of the commas? Or a line break gets inserted?
I realize you are looking for a regex solution but may I suggest using explode to create an array and apply a rule to each element. Having them separated into elements allows more flexibility when validating and storing:
$nums = explode(',', '00100,00247,90277,97030,00657');
foreach ($nums as $num) {
if (!preg_match('/^\d{5}$)/', trim($num))) {
// error!
}
}

I'd explode it and validate each string individually:
$input = '00100,00247,90277,97030,00657';
$input_array = explode(',', $input);
$is_valid = true;
foreach ($input_array as $number) {
if (preg_match("/\\d/", trim($number)) != strlen(trim($number))) {
$is_valid = false;
}
}
print($is_valid);

I think you rather need str_getcsv:
while ($row = str_getcsv($fp)) {
// $row is an array containing your digits
}

Simple. This regex matches a value having one or more comma separated 5-digit numbers:
if (preg_match('/^\d{5}(\s*,\s*\d{5})*$/', $value)) {
// Good value
}
It allows whitespace between the numbers as well.

This might work:
/^\d{5}(?:,\d{5})*$/
edit 1 noticed ridgerunner has the same answer, so disregard this.
edit 2 some notes on performance.
Failure analysis
Backtracking give back on failure:
^\d{5}(?:,\d{5})*$ gives back ,\d{5}
^(?:\d{5},)*\d{5}$ gives back \d{5},
Post Backtracking regressive topography checks:
(After backtracking give back, checks are to the right of the one that gave back)
^\d{5}(?:,\d{5})*$ checks for $
^(?:\d{5},)*\d{5}$ checks for \d{5}$
Winner: ^\d{5}(?:,\d{5})*$
NON-Backtracking regex's (using possesive quantifier +):
^\d{5}(?:,\d{5})*+$ gives nothing back, fails immediately
^(?:\d{5},)*+\d{5}$ gives nothing back fails immediately
Benchmarks
Using a string of 50 blocks of \d{5},.
The sample string is matched against each regex in a loop of 100,000 times.
Failure was induced at the end of the string, removed for a sucess test.
Sucess:
All took 1 second to complete a sucessfull run.
Failure, Backtracking:
^\d{5}(?:,\d{5})\*$ took 1.2 seconds best
^(?:\d{5},)\*\d{5}$ took 1.6 seconds
Failure, Non-Backtracking:
^\d{5}(?:,\d{5})*+$ took .9 seconds
^(?:\d{5},)*+\d{5}$ took .9 seconds
Conclusions
Backtracking - Put the smallest post-backtracking check
after the backtracking sub-expression. In this case, the
smallest is $.
In general, put the required expressions ahead of the optional ones.
Best ^\d{5}(?:,\d{5})*$
NON-Backtracking - It doesn't matter.
^\d{5}(?:,\d{5})*+$ or ^(?:\d{5},)*+\d{5}$

Any faster, simpler alternative to php preg_match

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.

I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.

If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.

Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)

If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.

strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.

Add tags manually? Just like we add tags here at SO.

How to split a string and find the occurence of one string in another?

I need to figure out how to do some C# code in php, and im not sure exactly how.
so first off i need the Split function, im going to have a string like
"identifier 82asdjka271akshjd18ajjd"
and i need to split the identifier word from the rest. so in C#, i used string.Split(new char{' '}); or something like that (working off the top of my head) and got two strings, the first word, and then the second part.. i understand that the php split function has been deprecated as of PHP 5.3.0.. so thats not an option, what are the alternatives?
and im also looking for a IndexOf function, so if i had the above code again as an example, i would need the location of 271 in the string, so i can generate a substring.

you can use explode for splitting and strpos for finding the index of one string inside another.
$a = "identifier 82asdjka271akshjd18ajjd";
$arr = explode(' ',$a); // split on space..to get an array of size 2.
$pos = strpos($arr[1],'271'); // search for '271' in the 2nd ele of array.
echo $pos; // prints 8

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP & word counting from string - php

preg_match_all will return the number of matches against a string. $count = preg_match_all("#$key#", $string); print "{$key} - {$count}";

Related

PHP finding "words" within a string

Read corresponding value in PHP and add to running sum

I need a PHP regular expression to validate string format of 5 digits, one comma

Any faster, simpler alternative to php preg_match

How to split a string and find the occurence of one string in another?

Categories

Resources