PHP sentence search

PHP sentence search - php

Is it possible to search a sentence by inputting first and last words?
Example:
We have a sentence "Hello world this is an example and here is FIRST word, or the start of a sentence that I need to take from here, and here is the LAST word, what means that everything before this word is not needed".
What I want to get:
"FIRST word, or the start of a sentence that I need to take from here, and here is the LAST".
Something like this, and I need it in the PHP.
My idea is to do this with some array, which starts saving words from first word until last word.

Question lacks any effort but is kind of interesting. This is 1 boring old fashioned way
<?php
$string="Hello world this is an example and here is FIRST word,
or the start of a sentence that I need to take from here,
and here is the LAST word, what means that everything
before this word is not needed";
$first="FIRST";
$last="LAST";
$string=substr($string,strpos($string,$first)); // Cut Left
$string=substr($string,0,strpos($string,$last)+strlen($last)); // Cut Right
echo $string;
Output
FIRST word, or the start of a sentence that I need to take from here,
and here is the LAST
But since you didnt put it any effort, dont expect an explanation and or a better way with regexes :)

Related

String Compression Class

I am trying to make a String compression system that could compress string with often used word in it.
But i have no idea on how i could make the logic work.
I was thinking of replacing world that apear often by a simple <1> and put that word in a array so that when we a reading the string we can see that <1> should be the first word in the array or some what.
But that is not my problem at the current moment.
Im trying to figure out how i could actually calculate how many time this word is appearing.
and i can't really use an explode(' ',$str); and check how many time it is there since i would like to check not only world but everything such as if there is allways a space between two world i would like to have them to store in my array also.
All of that in the idea of compressing a string.
Im am not looking for code tho, Im am simply trying to find a good logic i could make this work
Any one have an idea of how i could achieve that.
Thanks for any comment/awnser

I think the only way to do this is a sliding window... Hopefully you are using small strings :)
So, let's say your string was.
"Joey Novak Needs More Reputation :)"
We start with a 10 character string, and search the string for other instances of that string. So the first 10 character string would be "Joey Novak", Then we search the remainder of the string for that string. If we find one, awesome! We replace it with the marker (<1> works.) and search again, if we don't, we move on to the next string, which would be "oey Novak " and do the same, etc... When we finish with all the 10 character strings, we move on to 9 character, and work our way down. Since the marker is 3 characters long, you only need to go to 4 character strings.
Joey

Displaying words after taking data (user chosen word length) from a txt file PHP

I am a bit new to PHP and have a question that I would like to get some different ideas on. I am writing a PHP script that will open as a html form with an intro section, a few radio buttons, and a submit button. When the user clicks submit, there is a static text file that I have chosen(ebook) that I want the PHP script to remove words from the file depending on the length in which the user chooses (radio buttons are labeled 1,2,3,etc).
This is what I have been playing with:
<?php
$myTextFile = "ebook.txt";
$fileContents = get_file_contents($myTextFile);
$txtTok = strtok("allremoveddelimiters", $myTextFile);
while ($txtTok != false){
echo $txtTok;
$myTok = strtok("allremoveddelimiters");
}
?>
I have gotten my string stripped of all delimiters chosen, and it echoes out all on one line. This is where I have become stuck. What I am needing to do now is to pull words with a certain length (user chosen through radio buttons) out of the text file and then print the changed txt file to the screen. I have looked at the explode() function, but I am not sure that I really understand it. I also played with the str_replace() but it didn't seem to do what I needed it to, or I just didn't understand the complete function. Also, I was advised to not use the preg_() functions. I am assuming that I would want to take the txt file and put all the words in to an array and then pull the words of the user chosen length out and then print the updated array back as a string, but so far I haven't found an example that was understandable by my novice skills.
A push in any direction will be certainly appreciated. If any other information is needed, please advise.
Example:
The full string would contain this -
YOU don't know about me without you have read a book by the name of The
Adventures of Tom Sawyer; but that ain't no matter. That book was made
by Mr. Mark Twain, and he told the truth, mainly. There was things which
he stretched, but mainly he told the truth. That is nothing.
And if the user chooses 3 from the choices -
don't know about me without have read a book by name of Adventures of Sawyer; that ain't no matter. That book made by Mr. Mark Twain, he told truth, mainly. There things which he stretched, mainly he told truth. That is nothing.

Thinking about this quickly, I would recommend looking at regular expressions. This should easily allow you to extract words based on length. They can be quite complicated to understand though.
I would push you in the direction of preg_replace, http://php.net/manual/en/function.preg-replace.php. This will allow you (once you have your working regular expression) to extract words of the specified length, and replace them with an empty string.
I would recommend looking at a site like http://public.kvalley.com/regex/regex.asp for testing created regular expressions.
Having taken a quick look, I believe the following expression matches 4 letter words:
\b(\w{4})\b
Replace the 4 in the string with the length of the word you want to match. To break down the Regex simply, \b stands for a word boundary (ie space, full stop), \w stands for a word character (letters or digits), and {4} means match the previous item 4 times (so a 4 letter word). I believe the following will work.
$myReplacedString = preg_replace('/\b(\w{4})\b/', '', $myString);
$myTrimmedString = trim(preg_replace('/\s\s+/', ' ', $myReplacedString));
where $myString is your string you've fetched from your file.
I hope this explains it well.

Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?

This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.

Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

Regex to check if exact string exists

I am looking for a way to check if an exact string match exists in another string using Regex or any better method suggested. I understand that you tell regex to match a space or any other non-word character at the beginning or end of a string. However, I don't know exactly how to set it up.
Search String: t
String 1: Hello World, Nice to see you! t
String 2: Hello World, Nice to see you!
String 3: T Hello World, Nice to see you!
I would like to use the search string and compare it to String 1, String 2 and String 3 and only get a positive match from String 1 and String 3 but not from String 2.
Requirements:
Search String may be at any character position in the Subject.
There may or may not be a white-space character before or after it.
I do not want it to match if it is part of another string; such as part of a word.
For the sake of this question:
I think I would do this using this pattern: /\bt\b/gi
/\b{$search_string}\b/gi
Does this look right? Can it be made better? Any situations where this pattern wouldn't work?
Additional info: this will be used in PHP 5

Your suggestion of /\bt\b/gi will work and is probably the way to go. You've correctly used \b for word boundaries. You're using the global and case-insensitive modifiers which will find all matches in both cases. Simple, straight forward, clean. Look no further than what you've already come up with.

Looks fine to me. You might want to check the exact meaning of the \b assertion to make sure it's exactly what you need.
Can't really name any situation where this pattern "wouldn't work" without a more elaborate description, but \b would work fine for your testcases.

According to the old saying give a man a reg expression and he is happy for a day, teach him to write regular expression and he is happy for a lifetime (or something to that effect) try out the "regulator"
It provides a GUI and some pretty good examples for reg exp needs.

Find beginning of sentence in String

I want to display the results of a searchquery in a website with a title and a short description. The short description should be a small part of the page which holds the searchterm. What i want to do is:
1 strip tags in page
2 find first position of seachterm
3 from that position, going back find the beginning (if there is one) of that sentence.
4 Start at the found position in step 3 and display ie 200 characters from there
I need some help with step 3. I think i need an regex that finds the first capital or dot...

Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.

The way I would do it is, I would parse the page...
Skip over all the things starting with '<'
When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."
If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.
EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:
What you can do, is to start X number of characters from start of page (after tags)
and then search for your keyword, buffering 2 previous words. When you find it,
do something like this: {X} ... {prev-2} {next-2}
Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.
Search Keyword: "suggested"
Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...

For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.
$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);
That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.

I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.

To save others from thinking they can beat this problem - it can't be done without accepting either false positives or false negatives. To add to what James Curran said, you either declare Smith the start of the sentence in We went to Dr. Smith's office., or you read This sentence is English. So is this one. as a single sentence.
Next to those problems, different forms of abbreviations and Overeager Capitalization Of Every Word Can Kill Your Algorithm Or Regex.
That said, I might as well share the regexes I came up with.
The first regex is simple enough:
(?m)(?:^|[.!?][\t ]+)([A-Z]\S*)
It matches the start of a line or a .!?
This is followed by at least one tabs/whitespace, after which a capital letter is matched and the rest of the word (including dots to match abbreviations).
The first word of the sentence will be caught in group 1.
The second regex
(?m)[A-Z]\S*\.[^\S\r\n]+[A-Z]|(?:^|[.!?][\t ]+)([A-Z]\S*)
This is the previous regex, prepended with [A-Z]\S*\.[^\S\r\n]+[A-Z]|. This part matches a word starting with a capital, followed by a dot, some whitespace and another capitalized character. Because the first part gets matched, the second part no longer tries to match it (explained in-depth here). The first word of the sentence will again be caught in group 1.
The first regex has false positives: it will wrongly match Smith in the second half of the sentence We went to Dr. Smith's office.
The second regex has false negatives: it will fail to match So in This is sentence is English. So is this one.
Test the regexes here.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP sentence search - php

Related

String Compression Class

Displaying words after taking data (user chosen word length) from a txt file PHP

Best way to parse a text document

Regex to check if exact string exists

Find beginning of sentence in String

Categories

Resources