How to count sentences in <textarea>? - php

I have a textarea on page with UTF8 encoding.
How to count all sentences with php?
Update:
Sentence starts with a capital letter and ending by dot, question or exclamation mark.

From PHP's point of view, a <textarea> is simply another <input>, so it will be available through $_GET or $_POST as normal when the form is submitted.
Sentence counting in itself is quite complicated - you could count the number of sentences by the number of periods (.) in the text, but this would fail with abbreviations e.g. e.g.. You could do so by counting the number of periods followed by a space and then a capital letter, but this would fail for abbreviations followed by common nouns, and also for people who don't use capital letters at the beginning of their sentences. You could decide an average sentence length (say 70 characters) and approximate sentences = characters/70. None of these solutions are perfect (or even good, in my opinion).
UPDATE: Following your updated question, the following should be helpful:
<?php
preg_match_all("/(^|[.!?])\s*[A-Z]/",$_POST['textarea'],$matches);
$count = count($matches);

As Nobody was saying already, it depends on how you define a sentence. Is it a ? Is it a linebreak? Is it a capital?
I think it's really hard to define "a sentence", because for every definition you can think of 100 exceptions to that rule.
Anyway, if you come up with a definition, you could thus count the occurences of that in your textarea. Such as the number of linebreaks, the number of dots or the number of capital letters. Or combine all of those into one definition. So basically, just take the contents of your textarea and process some function on it. :-)
That's the best that can be answered to this question imo.
Edit After your edit my answer is:
function starts_with_upper($str) {
$chr = mb_substr ($str, 0, 1, "UTF-8");
return mb_strtolower($chr, "UTF-8") != $chr;
}
//Get sentences splitted by a dot and starting with a capital letter.
$total = 0;
$sentences = explode('.', rtrim($text, '.'));
for ($i = 0; $i < count($sentences); $i++) {
$sentence = $sentences[i];
if (starts_with_upper($sentence)) {
$total++;
}
}
echo "You have " . $total . " sentences ending in a dot.

If you treat sentence as a piece of words with dot at the end you can count dots in your text.
If you use new line, count \n's.

Related

Position in string of 4 digits number begining with "20"

I've got a string:
$string = "Something here 2014 another text here";
I need to detect position of the first 4 digits number that begins with "20".
So the result of the example would be 15th character of the $string.
Since you have commented with code you tried, I now feel comfortable answering your question properly :) Thank you for trying first!
Your attempt:
preg_match('/20\d\d/', "Something here 2014 another text here",
$matches, PREG_OFFSET_CAPTURE);
... is absolutely correct, however as you correctly pointed out, it would also match 20140 (and indeed 12014 would match too).
To fix this behaviour, you can add word boundaries - because numbers count as word characters. Your regex becomes:
'/\b20\d\d\b/'
This will ensure that there are no numbers (or letters, for that matter) immediately before or after your target four-digit number :)
What about...
$needle = "20";
$pos = strpos($string , $needle);
EDIT:
as requested, a way to get the string from this
$date = substr ($string , $pos , 4 ]);

Regular expression to match an exact number of occurrence for a certain character

I'm trying to check if a string has a certain number of occurrence of a character.
Example:
$string = '123~456~789~000';
I want to verify if this string has exactly 3 instances of the character ~.
Is that possible using regular expressions?
Yes
/^[^~]*~[^~]*~[^~]*~[^~]*$/
Explanation:
^ ... $ means the whole string in many regex dialects
[^~]* a string of zero or more non-tilde characters
~ a tilde character
The string can have as many non-tilde characters as necessary, appearing anywhere in the string, but must have exactly three tildes, no more and no less.
As single character is technically a substring, and the task is to count the number of its occurences, I suppose the most efficient approach lies in using a special PHP function - substr_count:
$string = '123~456~789~000';
if (substr_count($string, '~') === 3) {
// string is valid
}
Obviously, this approach won't work if you need to count the number of pattern matches (for example, while you can count the number of '0' in your string with substr_count, you better use preg_match_all to count digits).
Yet for this specific question it should be faster overall, as substr_count is optimized for one specific goal - count substrings - when preg_match_all is more on the universal side. )
I believe this should work for a variable number of characters:
^(?:[^~]*~[^~]*){3}$
The advantage here is that you just replace 3 with however many you want to check.
To make it more efficient, it can be written as
^[^~]*(?:~[^~]*){3}$
This is what you are looking for:
EDIT based on comment below:
<?php
$string = '123~456~789~000';
$total = preg_match_all('/~/', $string);
echo $total; // Shows 3

php preg_replace frustration

Im reluctant to ask but I cant figure out php preg_replace and ignore certain bits of the sting.
$string = '2012042410000102';
$string needs to look like _0424_102
The showing numbers are variable always changing and 2012 changes ever year
what I've tried:
^\d{4}[^\d{4}]10000[^\d{3}]$
^\d{4}[^\d]{4}10000[^\d]{3}$
Any help would be appreciated. I know it's a noob question but easy points for whoever helps.
Thanks
Your first regex is looking for:
The start of the string
Four digits (the year)
Any single character that is not a digit nor { or }
The number 10000
Any single character that is not a digit nor { or }
The end of the string
Your second regex is looking for:
The start of the string
Four digits (the year)
Any four characters that are not digits
The number 10000
Any three characters that are not digits
The end of the string
The regex you're looking for is:
^\d{4}(\d{4})10000(\d{3})$
And the replacement should be:
_$1_$2
This regex looks for:
The start of the string
Four digits (the year)
Capture four digits (the month and day)
The number 10000
Capture three digits (the 102 at the end in your example)
The end of the string
Try the following:
^\d{4}|10000(?=\d{3}$)
This will match either the first four digits in a string, or the string '10000' if there are three digits after '10000' before the end of the string.
You would use it like this:
preg_replace('/^\d{4}|10000(?=\d{3}$)/', '_', $string);
http://codepad.org/itTgEGo4
Just use simple string functions:
$string = '2012042410000102';
$new = '_'.str_replace('10000', '_', substr($string, 4));
http://codepad.org/elRSlCIP
If they're always in the same character locations, regular expressions seem unnecessary. You could use substrings to get the parts you want, like
sprintf('_%s_%s', substr($string,4,4), substr($string,13))
or
'_' . substr($string,4,4) . '_' . substr($string,13)

Read corresponding value in PHP and add to running sum

I would like to have each word in a string cross-referenced in a file.
So, if I was given the string: Jumping jacks wake me up in the morning.
I use some regex to strip out the period. Also, the entire string is made lowercase.
I then go on to have the words separated into an array by using PHP's nifty explode() function.
Now, what I'm left with, is an array with the words used in the string.
From there I need to look up each value in the array and get a value for it and add it to a running sum. for() loop it is. Okay, this is where I get stuck...
The list ($wordlist) is structured like so:
wake#4 waking#3 0.125
morning#2 -0.125
There are \ts in between the word and the number. There can be more than one word per value.
What I need the PHP to do now is look up the number to each word in the array then pull that corresponding number back to add it to a running sum. What's the best way for me to go about this?
The answer should be easy enough, just finding the location of the string in the wordlist and then finding the tab and from there reading the int... I just need some guidance.
Thanks in advance.
EDIT: to clarify -- I don't want the sum of the values of the wordlist, rather, I'd like to look up my individual values as they correspond to the words in the sentence and THEN look them up in the list and add just those values; not all of them.
Edited answer based on your comment and question edit. The running sum is stored in an array called $sum where the key value of the "word" will store the value of its running sum. e.g $sum['wake'] will store the running sum for the word wake and so on.
$sum = array();
foreach($wordlist as $word) //Loop through each word in wordlist
{
// Getting the value for the word by matching pattern.
//The number value for each word is stored in an array $word_values, where the key is the word and value is the value for that word.
// The word is got by matching upto '#'. The first parenthesis matches the word - (\w+)
//The word is followed by #, single digit(\d), multiple spaces(\s+), then the number value(\S+ matches the rest of the non-space characters)
//The second parenthesis matches the number value for the word
preg_match('/(\w+)#\d\s+(\S+)/', $word, $match);
$word_ref = $match[1];
$word_ref_number = $match[2];
$word_values["$word_ref"] = $word_ref_number;
}
//Assuming $sentence_array to store the array of words used in your string example {"Jumping", "jacks", "wake", "me", "up", "in", "the", "morning"}
foreach ($sentence_array as $word)
{
if (!array_key_exists("$word", $sum)) $sum["$word"] = 0;
$sum["$word"] += $word_values["$word"];
}
Am assuming you would take care of case sensitivities, since you mentioned that you make the entire string lowercase, so am not including that here.
$sentence = 'Jumping jacks wake me up in the morning';
$words=array();
foreach( explode(' ',$sentence) as $w ){
if( !array_key_exists($w,$words) ){
$words[$w]++;
} else {
$words[$w]=1;
}
}
explodeby space, check if that word is in the words array as key; if so increment it's count(val); if not, set it's val as 1. Loop this for each of your sentences without redeclaring the $words=array()

A PHP Library / Class to Count Words in Various Languages?

Some time in the near future I will need to implement a cross-language word count, or if that is not possible, a cross-language character count.
By word count I mean an accurate count of the words contained within the given text, taking the language of the text. The language of the text is set by a user, and will be assumed to be correct.
By character count I mean a count of the "possibly in a word" characters contained within the given text, with the same language information described above.
I would much prefer the former count, but I am aware of the difficulties involved. I am also aware that the latter count is much easier, but very much prefer the former, if at all possible.
I'd love it if I just had to look at English, but I need to consider every language here, Chinese, Korean, English, Arabic, Hindi, and so on.
I would like to know if Stack Overflow has any leads on where to start looking for an existing product / method to do this in PHP, as I am a good lazy programmer*
A simple test showing how str_word_count with set_locale doesn't work, and a function from php.net's str_word_count page.
*http://blogoscoped.com/archive/2005-08-24-n14.html
Counting chars is easy:
echo strlen('一个有十的字符的句子'); // 30 (WRONG!)
echo strlen(utf8_decode('一个有十的字符的句子')); // 10
Counting words is where things start to get tricky, specially for Chinese, Japanese and other languages that don't use spaces (or other common "word boundary" characters) as word separators. I don't speak Chinese and I don't understand how word counting works in Chinese, so you'll have to educate me a bit - what makes a word in these languages? Is it any specific char or set of chars? I remember reading something related to how hard it was to identify Japanese words in T9 writing but can't find it anymore.
The following should correctly return the number of words in languages that use spaces or punctuation chars as words separators:
count(preg_split('~[\p{Z}\p{P}]+~u', $string, null, PREG_SPLIT_NO_EMPTY));
A quick trick if you only want approximate and not exact words is
<?php echo count(explode(' ',$string)); ?>
It works by counting spaces in just any language. I have used this for a translator script. Again it will not count exact words but give approximate words in a para.
Well, try:
<?
function count_words($str){
$words = 0;
$str = eregi_replace(" +", " ", $str);
$array = explode(" ", $str);
for($i=0;$i < count($array);$i++)
{
if (eregi("[0-9A-Za-zÀ-ÖØ-öø-ÿ]", $array[$i]))
$words++;
}
return $words;
}
echo count_words('This is the second one , it will count wrong as well" , it will count 12 instead of 11 because the comma is counted too.');
?>

Categories