I have a task to count sentences without using str_word_count, my senior gave it to me but I am not able to understand. Can someone explain it?
I need to understand the variable and how it works.
<?php
$sentences = "this book are bigger than encyclopedia";
function countSentences($sentences) {
$y = "";
$numberOfSentences = 0;
$index = 0;
while($sentences != $y) {
$y .= $sentences[$index];
if ($sentences[$index] == " ") {
$numberOfSentences++;
}
$index++;
}
$numberOfSentences++;
return $numberOfSentences;
}
echo countSentences($sentences);
?>
The output is
6
It's something very trivial, I'd say.
The task is to count words in a sentence. A sentence is an string (a sequence of characters) that are letters or white spaces (space, new line, etc.)...
Now, what's a word of the sentence? It is a distinct group of letters that "don't touch" other group of letters; meaning words (group of letters) are separated from each other with white space (let's say just a normal blank space)
So the simplest algorithm to count words consist in:
- $words_count_variable = 0
- go through all the characters, one-by-one
- each time you find a space, it means a new word just ended before that, and you have to increase your $words_count_variable
- lastly, you'll find the end of the string, and that means a word just ended before that, so you'll increase for the last time your $words_count_variable
Take "this is a sentence".
We set $words_count_variable = 0;
Your while cycle will analyze:
"t"
"h"
"i"
"s"
" " -> blank space: a word just ended -> $words_count_variable++ (becomes 1)
"i"
"s"
" " -> blank space: a word just ended -> $words_count_variable++ (becomes 2)
"a"
" " -> blank space: a word just ended -> $words_count_variable++ (becomes 3)
"s"
"e"
"n"
...
"n"
"c"
"e"
-> end reached: a word just ended -> $words_count_variable++ (becomes 4)
So, 4.
4 words counted.
Hope this was helpful.
Basicaly, it is just counting the number of space in a sentence.
<?php
$sentences = "this book are bigger than encyclopedia";
function countSentences($sentences) {
$y = ""; // Temporary variable used to reach all chars in $sentences during the loop
$numberOfSentences = 0; // Counter of words
$index = 0; // Array index used for $sentences
// Reach all chars from $sentences (char by char)
while($sentences != $y) {
$y .= $sentences[$index]; // Adding the current char in $y
// If current char is a space, we increase the counter of word
if ($sentences[$index] == " "){
$numberOfSentences++;
}
$index++; // Increment the index used with $sentences in order to reach the next char in the next loop round
}
$numberOfSentences++; // Additional incrementation to count the last word
return $numberOfSentences;
}
echo countSentences($sentences);
?>
Be aware that this function will have wrong results on several case, for example if you have two spaces following, this function will count 2 words instead of one.
Related
I am trying to get a count of common phrases from a body of text. I don't just want single words, but rather all series of words between any stop words. So for example, https://en.wikipedia.org/wiki/Wuthering_Heights I would like the phrase "wuthering heights" to be counted rather than "wuthering" and "heights".
if (in_array($word, $this->stopwords))
{
$cleanPhrase = preg_replace("/[^A-Za-z ]/", '', $currentPhrase);
$cleanPhrase = trim($cleanPhrase);
if($cleanPhrase != "" && strlen($cleanPhrase) > 2)
{
$this->Phrases[$cleanPhrase] = substr_count($normalisedText, $cleanPhrase);
$currentPhrase = "";
}
continue;
}
else
$currentPhrase = $currentPhrase . $word . " ";
The problem I have with this "age" is being counted if the word "stage" is being used. The solution here is to add whitespace to either side of the $cleanPhrase variable. The problem this leads to then is if there is no white space. There could be a comma, full stop or some other character that signals some kind of punctuation. I want to count all of these. Is there a way I can do this without having to do something like this.
$terminate = array(".", " ", ",", "!", "?");
$count = 0;
foreach($terminate as $tpun)
{
$count += substr_count($normalisedText, $tpun . $cleanPhrase . $tpun);
}
By utilizing this answer with slight modification, you can do this:
$sentence = "Age: In this day and age, people of all age are on the stage.";
$word = 'age';
preg_match_all('/\b'.$word.'\b/i', $sentence, $matches);
\b represents a word boundary. So that string will give a count of 3 if searching for age (the i flag in the pattern means case insensitive, you can remove it if you want to match case as well).
If you're only going to match on one phrase at a time, you'll find your count in count($matches[0]).
I want to take a post description but only display the first, for example, 30 letters but ignore any tabs and spaces.
$msg = 'I only need the first, let us just say, 30 characters; for the time being.';
$msg .= ' Now I need to remove the spaces out of the checking.';
$amount = 30;
// if tabs or spaces exist, alter the amount
if(preg_match("/\s/", $msg)) {
$stripped_amount = strlen(str_replace(' ', '', $msg));
$amount = $amount + (strlen($msg) - $stripped_amount);
}
echo substr($msg, 0, $amount);
echo '<br /> <br />';
echo substr(str_replace(' ', '', $msg), 0, 30);
The first output gives me 'I only need the first, let us just say, 30 characters;' and the second output gives me: Ionlyneedthefirst,letusjustsay so I know this isn't working as expected.
My desired output in this case would be:
I only need the first, let us just say
Thanks in advance, my maths sucks.
You could get the part with the first 30 characters with a regular expression:
$msg_short = preg_replace('/^((\s*\S\s*){0,30}).*/s', '$1', $msg);
With the given $msg value, you will get in $msg_short:
I only need the first, let us just say
Explanation of the regular expression
^: match must start at the beginning of the string
\s*\S\s* a non-white-space (\S) surrounded by zero or more white-space characters (\s*)
(\s*\S\s*){0,30} repeat finding this sequence up to 30 times (greedy; get as many as possible within that limit)
((\s*\S\s*){0,30}) the parentheses make this series of characters group number 1, which can be referenced as $1
.* any other characters. This will match all remaining characters, because of the s modifier at the end:
s: makes the dot match new line characters as well
In the replacement only the characters are maintained that belong to group one ($1). All the rest is ignored and not included in the returned string.
Spontaneously, there are two ways to achieve that I can think of.
The first one is close to what you did already. Take the first 30 characters, count the spaces and take as many next characters as you found spaces until the new set of letters has no spaces in it anymore.
$msg = 'I only need the first, let us just say, 30 characters; for the time being.';
$msg .= ' Now I need to remove the spaces out of the checking.';
$amount = 30;
$offset = 0;
$final_string = '';
while ($amount > 0) {
$tmp_string = substr($msg, $offset, $amount);
$amount -= strlen(str_replace(' ', '', $tmp_string));
$offset += strlen($tmp_string);
$final_string .= $tmp_string;
}
print $final_string;
The second technique would be to explode your string at spaces and put them back together one by one until you hit your threshold (where you would eventually need to break down a single word into characters).
Try this out if it works:
<?php
$string= 'I only need the first, let us just say, 30 characters; for the time being.';
echo "Everything: ".strlen($string);
echo '<br />';
echo "Only alphabetical: ".strlen(preg_replace('/[^a-zA-Z]/', '', $string));
?>
It can be done this way.
$tmp=str_split($string);//split the string
$result="";
$i=0;$j=0;
while(isset($tmp[$i]) && $j<30){
if(trim($tmp[$i])){//test for non space and count
$j++;
}
$result .= $tmp[$i++];
}
print $result;
I don't know regex too well so...
<?php
$msg = 'I only need the first, let us just say, 30 characters; for the time being. Now I need to remove the spaces out of the checking.';
$non_space_hit = 0;
for($i = 0; $i < strlen($msg); ++$i)
{
echo $msg[$i];
$non_space_hit+= (int)($msg[$i] !== ' ' && $msg[$i] !== "\t");
if($non_space_hit === 30)
{
break;
}
}
You end up with:
I only need the first, let us just say
Let say I have a huge string where I want to extract e certain value belonging to a name, for example the stockprice of Apple.
Let say say the string look like this (in reality its html but that does not matter here)
$output = "nsdfsdnfsnfdnsdfnueruherdfndsdndnjsdnasdnn Apple dndfjnfjdf647tgtgtgeq";
I want to extract the value 647.
The real string is maybe some hundred thousand characters.
I can reveal the position of Apple by:
$str = "Apple";
$pos = strpos($output, $str);
let say the function returns 87310 which is the indexposition of the first letter in Apple.
Here comes my question? Is there an easy way to extract the value when I know the startposition of Apple? I have looked for such a function but can right now not find it.
I could solve this easily by just looping ahead of the name Apple and then extract the relevant characters? But it would at the least save keystrokes to use a function for this instead.
Thanks!!!
To just pull out the stock price, you would want to do something like this:
Search your string for "Apple" and save $position + 5 (length of Apple). Search directly after $position, one character at a time, for the first character that is_numeric and add that to a string, $stock_val. Continue adding all subsequent characters until you find one that !is_numeric. Here is my clunky code:
$position = strpos(strtolower($str), "apple") + strlen("apple");
$temp_str = substr($str, $position);
$stock_val = "";
do {
$char = substr($temp_str, 0, 1); //Take first char of $temp_str
$temp_str = substr($temp_str, 1); //Remove that char from $temp_str
$is_acceptable = (is_numeric($char) || $char == "." || $char == ",");
if($is_acceptable) { //If the char is_numeric, add it to $stock_val
$stock_val .= $char;
}
if(!$is_acceptable && $stock_val != "") {
break; //If the char is NOT numeric AND $stock_val
} //already has characters, break.
} while(strlen($temp_str) > 0); //Repeat while there are still characters
you know the start position so calculate the end position by doing strlen($str) then use substr to cut away the unwanted string
something like this using substr
$portion = substr(substr($string, 0, -(strlen($string) - $end)), $start);
Objective: Search through an array of tens of thousands of Chinese sentences in order to locate sentences that exclusively contain characters from a "known characters" array.
For example:
Let's say my corpus consists of the following sentences: 1) 我去中国。 2) 妳爱他。 3) 你在哪里?
I only "know" or want sentences that exclusively contain these characters: 1) 我 2) 中 3) 国 4) 你 5) 在 6) 去 7) 爱 8) 哪 9) 里.
The first sentence would be returned as result because all three of its characters are in my second array. The second sentence would be rejected because I did not ask for 妳 or 他. The third sentence would be returned as a result. The punctuation marks are ignored (as well as any alpha-numeric characters).
I have a working script that does this (below). I'm wondering if this is an efficient way or not. If you are interested, please take a look and suggest changes, write your own, or give some advice. I've gleaned some from this script and checked out some stackoverflow questions, but they didn't address this scenario.
<?php
$known_characters = parse_file("FILENAME") // retrieves target characters
$sentences = parse_csv("FILENAME"); // retrieves the text corpus
$number_wanted = 30; // number of sentences to attempt to retrieve
$found = array(); // stores results
$number_found = 0; // number of results
$character_known = false; // assume character is not known
$sentence_known = true; // assume sentence matches target characters
foreach ($sentences as $s) {
// retrieves an array of the sentence
$sentence_characters = mb_str_split($s->ttext);
foreach ($sentence_characters as $sc) {
// check to see if the character is alpha-numeric or punctuation
// if so, then ignore.
$pattern = '/[a-zA-Z0-9\s\x{3000}-\x{303F}\x{FF00}-\x{FF5A}]/u';
if (!preg_match($pattern, $sc)) {
foreach ($known_characters as $kc) {;
if ($sc==$kc) {
// if character is known, move to next character
$character_known = true;
break;
}
}
} else {
// character is known if it is alpha-numeric or punctuation
$character_known = true;
}
if (!$character_known) {
// if character is unknown, move to next sentence
$sentence_known = false;
break;
}
$character_known = false; // reset for next iteration
}
if ($sentence_known) {
// if sentence is known, add it to results array
$found[] = $s->ttext;
$number_found = $number_found+1;
}
if ($number_found==$number_wanted)
break; // if required number of results are found, break
$sentence_known = true; // reset for next iteration
}
?>
It appears to me this should do it:
$pattern = '/[^a-zA-Z0-9\s\x{3000}-\x{303F}\x{FF00}-\x{FF5A}我中国你在去爱哪里]/u';
if (preg_match($pattern, $sentence) {
// the sentence contains characters besides a-zA-Z0-9, punctuation
// and the selected characters
} else {
// the sentence contains only the allowed characters
}
Make sure to save your source code file in UTF-8.
I have a query that I'm trying to run against a database of words. The query must return a word that contains another word, and letters given. It is in PHP and mySQL.
For example:
Word Given: Cruel
Letters Given: abcdty
In the database, I need to find the word "Cruelty" based on the letters given, and the word given. It needs to works both ways. So if I had "atni" for letters, "Anticruel" would appear if it existed in the database.
I have it half working but the result given is not correct:
SELECT word
FROM words
WHERE LOCATE( "cruel", word ) >0
AND word != "cruel"
AND word
REGEXP '[ybilteh]'
The result set from this query:
"anticruelty"
"crueler"
"cruelest"
"crueller"
"cruellest"
"cruelly"
"cruelness"
"cruelnesses"
"cruelties"
"cruelty"
Update!!!
Thanks to Benjamin Morel, this is getting much closer.
This query:
SELECT word
FROM words
WHERE LOCATE( "t", word ) >0
AND word != "t"
AND word
REGEXP '^[ybilteh]*t[ybilteh]*$'
LIMIT 0 , 30
Finds words correctly. But also includes words with double letters. Such as "Beet". When only 1 "e" is available.
Try this one:
SELECT word
FROM words
WHERE word REGEXP '^[ybilteh]*cruel[ybilteh]*$'
AND word != 'cruel';
UPDATE: let's go refining with PHP, what about this?
$word = 'cruel';
$letters = 'ybilteh';
$items = array("anticruelty", "crueler", "cruelest",
"crueller", "cruellest", "cruelly", "cruelness",
"cruelnesses", "cruelties", "cruelty");
$letters = str_split($letters);
foreach ($items as $item) {
$list = $letters;
// remove the original word (once)
$thisItem = preg_replace("/$word/", '', $item, 1);
for ($i=0; $i<strlen($thisItem); $i++) {
$index = array_search($thisItem[$i], $list);
if ($index === false) {
continue 2; // letter not available
}
unset($list[$index]); // remove the letter from the list
}
echo "$item\n"; // passed!
}
Returns: cruelly, cruelty
You might probably find a better/simpler approach, but that should do the trick!