PHP count word frequency with support for punctuation marks - php

I am trying to get a count of common phrases from a body of text. I don't just want single words, but rather all series of words between any stop words. So for example, https://en.wikipedia.org/wiki/Wuthering_Heights I would like the phrase "wuthering heights" to be counted rather than "wuthering" and "heights".
if (in_array($word, $this->stopwords))
{
$cleanPhrase = preg_replace("/[^A-Za-z ]/", '', $currentPhrase);
$cleanPhrase = trim($cleanPhrase);
if($cleanPhrase != "" && strlen($cleanPhrase) > 2)
{
$this->Phrases[$cleanPhrase] = substr_count($normalisedText, $cleanPhrase);
$currentPhrase = "";
}
continue;
}
else
$currentPhrase = $currentPhrase . $word . " ";
The problem I have with this "age" is being counted if the word "stage" is being used. The solution here is to add whitespace to either side of the $cleanPhrase variable. The problem this leads to then is if there is no white space. There could be a comma, full stop or some other character that signals some kind of punctuation. I want to count all of these. Is there a way I can do this without having to do something like this.
$terminate = array(".", " ", ",", "!", "?");
$count = 0;
foreach($terminate as $tpun)
{
$count += substr_count($normalisedText, $tpun . $cleanPhrase . $tpun);
}

By utilizing this answer with slight modification, you can do this:
$sentence = "Age: In this day and age, people of all age are on the stage.";
$word = 'age';
preg_match_all('/\b'.$word.'\b/i', $sentence, $matches);
\b represents a word boundary. So that string will give a count of 3 if searching for age (the i flag in the pattern means case insensitive, you can remove it if you want to match case as well).
If you're only going to match on one phrase at a time, you'll find your count in count($matches[0]).

Related

PHP: How to extract a substring from a specified index until the next whitespace or end of line

I have an input string:
$subject = "This punctuation! And this one. Does n't space that one."
I also have an array containing exceptions to the replacement I wish to perform, currently with one member:
$exceptions = array(
0 => "n't"
);
The reason for the complicated solution I would like to achieve is because this array will be extended in future and could potentially include hundreds of members.
I would like to insert whitespace at word boundaries (duplicate whitespace will be removed later). Certain boundaries should be ignored, though. For example, the exclamation mark and full stops in the above sentence should be surrounded with whitespace, but the apostrophe should not. Once duplicate whitespaces are removed from the final result with trim(preg_replace('/\s+/', ' ', $subject));, it should look like this:
"This punctuation ! And this one . Does n't space that one ."
I am working on a solution as follows:
Use preg_match('\b', $subject, $offsets, 'PREG_OFFSET_CAPTURE'); to gather an array of indexes where whitespace may be inserted.
Iterate over the $offsets array.
split $subject from whitespace before the current offset until the next whitespace or end of line.
check if result of split is contained within $exceptions array.
if result of split is not contained within exceptions array, insert whitespace character at current offset.
So far I have the following code:
$subject="This punctuation! And this one. Does n't space that one.";
$pattern = '/\b/';
preg_match($pattern, $subject, $offsets, PREG_OFFSET_CAPTURE );
if(COUNT($offsets)) {
$indexes = array();
for($i=0;$i<COUNT($offsets);$i++) {
$offsets[$i];
$substring = '?';
// Replace $substring with substring from after whitespace prior to $offsets[$i] until next whitespace...
if(!array_search($substring, $exceptions)) {
$indexes[] = $offsets[$i];
}
}
// Insert whitespace character at each offset stored in $indexes...
}
I can't find an appropriate way to create the $substring variable in order to complete the above example.
$res = preg_replace("/(?:n't|ALL EXCEPTIONS PIPE SEPARATED)(*SKIP)(*F)|(?!^)(?<!\h)\b(?!\h)/", " ", $subject);
echo $res;
Output:
This punctuation ! And this one . Doesn't space that one .
Demo & explanation
One "easy" (but not necessarily fast, depending on how many exceptions you have) solution would be to first replace all the exceptions in the string with something unique that doesn't contain any punctuation, then perform your replacements, then convert back the unique replacement strings into their original versions.
Here's an example using md5 (but could be lots of other things):
$subject = "This punctuation! And this one. Doesn't space that one.";
$exceptions = ["n't"];
foreach ($exceptions as $exception) {
$result = str_replace($exception, md5($exception), $subject);
}
$result = preg_replace('/[^a-z0-9\s]/i', ' \0', $result);
foreach ($exceptions as $exception) {
$result = str_replace(md5($exception), $exception, $result);
}
echo $result; // This punctuation ! And this one . Doesn't space that one .
Demo

How can I check whether the input string partailly matches any word in the given array in php?

For example my input string is
$edition = Vol.123 or Edition 1920 or Volume 951 or Release A20 or Volume204 or Edition967
How can I check the words in string matches any word in the array.
$editionFormats = ['Vol','Volume','Edition','Release'];
Basically I need to check whether the input has Vol or Volume or Edition or Release.
Can anyone please provide a way to check the pattern?
I tried with str_pos(), preg_grep(), preg_match(), split(), str_split() What I thought was to split the string after the first occurance of period or white space or numeric ,
but wasnt able to find it.
Solution with regexp:
$edition[] = 'Vol.123';
$edition[] = 'Edition 1920';
$edition[] = 'Volume 951';
$edition[] = 'Release A20';
$edition[] = 'Unknown data';
$editionFormats = ['Vol','Volume','Edition','Release'];
$pattern = implode('|', $editionFormats);
foreach ($edition as $e) {
if (preg_match('/' . $pattern. '/', $e)) {
echo $e . ' matches' . PHP_EOL;
} else {
echo $e . ' NOT matches' . PHP_EOL;
}
}
Fiddle.
Assuming your input is a single string (it wasn't obvious to me from the question)
A non regex way to do it is to look at the intersection between the set of words in the incoming string, and the set of words you're interested in:
$edition = 'Vol.123 or Edition 1920 or Volume 951 or Release A20'
$editionFormats = ['Vol','Volume','Edition','Release'];
// Break $edition into single words on, on space character.
$edition_words = explode(" ", $edition);
$present = !empty(array_intersect($edition_words, $editionFormats));
If you had meant that the $edition would be just one of those;
i.e.
$edition = 'Volume 951'
This approach will still work; Note that splitting on the space character only works if there is a space, so your 'Vol.123' wouldn't get matched, unless you also included 'Vol.' in your $editionFormats.

php replace with regex and remove specified pattern with regex

|affffc100|Hitem:bb:101:1:1:1:1:48:-30:47:18:5:2:6:6:0:0:0:0:0:0:0:0|h[Subject Name]|h|r
my usual printed out variable is ^
|cffffc700|Hitem:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x|h[SUBJECT_NAME]|h|r
my pattern is ^
ALL X's can be a-Z, 0-9
in one column I have many variables like that (up to 8).
and all variables are mixed with strings like that:
|affffc100|Hitem:bb:101:1:1:1:1:48:-30:47:18:5:2:6:6:0:0:0:0:0:0:0:0|h[Gold]|h|r NEW SOLD |affffc451|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Copper]|h|r maximum price 15k|affffx312|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Silver]|h|r
In one variable I want to clean all these unnecessary patterns and leave only subject name in brackets. []
So;
|cffffc700|Hitem:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x:x|h[SUBJECT NAME]|h|r
needs to leave only SUBJECT_NAME in my variable.
just to remind, I have always more than one from these pattern in my every variable... (up to 8)
I've searched it everywhere but couldn't find any reasonable answers NOR good patterns. Tried to make it myself but I guess I need to take all these patterns and make it array and clean it and only leave these subject names but I don't know exactly how to do it.
how do I convert this to :
|affffc100|Hitem:bb:101:1:1:1:1:48:-30:47:18:5:2:6:6:0:0:0:0:0:0:0:0|h[Gold]|h|r NEW SOLD |affffc451|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Copper]|h|r maximum price 15k|affffx312|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Silver]|h|r
this:
Gold NEW SOLD Copper maxiumum price 15k Silver
what should I use, preg_replace?
one more thing left, when I have a string without my special pattern, I get empty result from the function eg:
$str = "15KKK sold, 20KK updated";
expected result:
"15KKK sold, 20KK updated" // same without any pattern
but ^ that one returns EMPTY result..
another string:
$str = "|affffc100|Hitem:bb:101:1:1:1:1:48:-30:47:18:5:2:6:6:0:0:0:0:0:0:0:0|h[Uranium]|h|r 155kk |affffc451|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Metal]|h|r is sold";
expected result:
"Uranium 155kk Metal is sold"
if I use that function with non-pattern string it returns empty result that's my problem now
thank you very much
I'd do:
$str = '|affffc100|Hitem:bb:101:1:1:1:1:48:-30:47:18:5:2:6:6:0:0:0:0:0:0:0:0|h[Gold]|h|r NEW SOLD |affffc451|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Copper]|h|r maximum price 15k|affffx312|Hitem:bb:101:1:1:1:1:25:-33:12:42:5a:2f:6w:6:0:0:0:0f:0:0a:0b:0|h[Silver]|h|r';
preg_match_all('/h(\[.+?\])\|h\|r([^|]*)/', $str, $m);
for($i=0; $i<count($m[0]); $i++) {
$res .= $m[1][$i] . ' ' . $m[2][$i] . ' ';
}
echo $res,"\n";
Output:
[Gold] NEW SOLD [Copper] maximum price 15k [Silver]
If you want to keep the strings that don't match, test the result of preg_match:
if (preg_match_all('/h(\[.+?\])\|h\|r([^|]*)/', $str, $m)) {
for($i=0; $i<count($m[0]); $i++) {
$res .= $m[1][$i] . ' ' . $m[2][$i] . ' ';
}
} else {
$res = $str;
}
echo $res,"\n";
try this regex:
\|\w{9}\|Hitem(?::-?\w+)+\|h\[(?<SUBJECTNAME>\w+)\]\|h\|r
it will capture each variable sequence, as well as the relevant element name in the named group.
see the demo here

How to count the first 30 letters in a string ignoring spaces

I want to take a post description but only display the first, for example, 30 letters but ignore any tabs and spaces.
$msg = 'I only need the first, let us just say, 30 characters; for the time being.';
$msg .= ' Now I need to remove the spaces out of the checking.';
$amount = 30;
// if tabs or spaces exist, alter the amount
if(preg_match("/\s/", $msg)) {
$stripped_amount = strlen(str_replace(' ', '', $msg));
$amount = $amount + (strlen($msg) - $stripped_amount);
}
echo substr($msg, 0, $amount);
echo '<br /> <br />';
echo substr(str_replace(' ', '', $msg), 0, 30);
The first output gives me 'I only need the first, let us just say, 30 characters;' and the second output gives me: Ionlyneedthefirst,letusjustsay so I know this isn't working as expected.
My desired output in this case would be:
I only need the first, let us just say
Thanks in advance, my maths sucks.
You could get the part with the first 30 characters with a regular expression:
$msg_short = preg_replace('/^((\s*\S\s*){0,30}).*/s', '$1', $msg);
With the given $msg value, you will get in $msg_short:
I only need the first, let us just say
Explanation of the regular expression
^: match must start at the beginning of the string
\s*\S\s* a non-white-space (\S) surrounded by zero or more white-space characters (\s*)
(\s*\S\s*){0,30} repeat finding this sequence up to 30 times (greedy; get as many as possible within that limit)
((\s*\S\s*){0,30}) the parentheses make this series of characters group number 1, which can be referenced as $1
.* any other characters. This will match all remaining characters, because of the s modifier at the end:
s: makes the dot match new line characters as well
In the replacement only the characters are maintained that belong to group one ($1). All the rest is ignored and not included in the returned string.
Spontaneously, there are two ways to achieve that I can think of.
The first one is close to what you did already. Take the first 30 characters, count the spaces and take as many next characters as you found spaces until the new set of letters has no spaces in it anymore.
$msg = 'I only need the first, let us just say, 30 characters; for the time being.';
$msg .= ' Now I need to remove the spaces out of the checking.';
$amount = 30;
$offset = 0;
$final_string = '';
while ($amount > 0) {
$tmp_string = substr($msg, $offset, $amount);
$amount -= strlen(str_replace(' ', '', $tmp_string));
$offset += strlen($tmp_string);
$final_string .= $tmp_string;
}
print $final_string;
The second technique would be to explode your string at spaces and put them back together one by one until you hit your threshold (where you would eventually need to break down a single word into characters).
Try this out if it works:
<?php
$string= 'I only need the first, let us just say, 30 characters; for the time being.';
echo "Everything: ".strlen($string);
echo '<br />';
echo "Only alphabetical: ".strlen(preg_replace('/[^a-zA-Z]/', '', $string));
?>
It can be done this way.
$tmp=str_split($string);//split the string
$result="";
$i=0;$j=0;
while(isset($tmp[$i]) && $j<30){
if(trim($tmp[$i])){//test for non space and count
$j++;
}
$result .= $tmp[$i++];
}
print $result;
I don't know regex too well so...
<?php
$msg = 'I only need the first, let us just say, 30 characters; for the time being. Now I need to remove the spaces out of the checking.';
$non_space_hit = 0;
for($i = 0; $i < strlen($msg); ++$i)
{
echo $msg[$i];
$non_space_hit+= (int)($msg[$i] !== ' ' && $msg[$i] !== "\t");
if($non_space_hit === 30)
{
break;
}
}
You end up with:
I only need the first, let us just say

Extract substring from a certain indexposition of (huge) string

Let say I have a huge string where I want to extract e certain value belonging to a name, for example the stockprice of Apple.
Let say say the string look like this (in reality its html but that does not matter here)
$output = "nsdfsdnfsnfdnsdfnueruherdfndsdndnjsdnasdnn Apple dndfjnfjdf647tgtgtgeq";
I want to extract the value 647.
The real string is maybe some hundred thousand characters.
I can reveal the position of Apple by:
$str = "Apple";
$pos = strpos($output, $str);
let say the function returns 87310 which is the indexposition of the first letter in Apple.
Here comes my question? Is there an easy way to extract the value when I know the startposition of Apple? I have looked for such a function but can right now not find it.
I could solve this easily by just looping ahead of the name Apple and then extract the relevant characters? But it would at the least save keystrokes to use a function for this instead.
Thanks!!!
To just pull out the stock price, you would want to do something like this:
Search your string for "Apple" and save $position + 5 (length of Apple). Search directly after $position, one character at a time, for the first character that is_numeric and add that to a string, $stock_val. Continue adding all subsequent characters until you find one that !is_numeric. Here is my clunky code:
$position = strpos(strtolower($str), "apple") + strlen("apple");
$temp_str = substr($str, $position);
$stock_val = "";
do {
$char = substr($temp_str, 0, 1); //Take first char of $temp_str
$temp_str = substr($temp_str, 1); //Remove that char from $temp_str
$is_acceptable = (is_numeric($char) || $char == "." || $char == ",");
if($is_acceptable) { //If the char is_numeric, add it to $stock_val
$stock_val .= $char;
}
if(!$is_acceptable && $stock_val != "") {
break; //If the char is NOT numeric AND $stock_val
} //already has characters, break.
} while(strlen($temp_str) > 0); //Repeat while there are still characters
you know the start position so calculate the end position by doing strlen($str) then use substr to cut away the unwanted string
something like this using substr
$portion = substr(substr($string, 0, -(strlen($string) - $end)), $start);

Categories