Is there a predefined library/function in php or python that would let me extract 2-3 paragraphs out of a complete document based on the proximity of the keywords to be found in the document.
Let's say i have 5 keywords, A,B,C,D,E. And I have an essay containing multiple occurrences of all these keywords.
I would like to extract a few paragraphs from it which contain the closest occurrences of the keywords.
Probably split/explode on new lines and the enumerate through the ones you want.
text = """Test paragraph 1.
Test paragraph 2.
Test paragraph 3.
Test paragraph 4."""
paragraphs = text.split("\n\n")
print(paragraphs[1:3])
Related
we are designing translation project from Sindhi to English, in Sindhi (Pakistani/Indus) Language their are so many words with double word or having space bw them but have one meaning like in English to eat. it is two word, but have single meaning. I want to design a program to read starting double word, search it in database if meaning found then put and read next double word, and if meaning not found then read first word and find meaning, if found meaning then read next two words after first single word. for example I want to do this
this is simple sentence
I want to eat a mango.
I want to PHP or visual basic.net to break it into this style
I want
I
want to
want
to eat
to
eat mango
eat
mango
with this example all words are read both in single and double style.
I have some hints
use loop for (i=0, i<=length of text, i++)
sense word sepration where space or panctuation marks are used
coding may be
str=substr(text, i, 1)
if str is= " " or str= punctuation marks (space or punctuation mark is the word separators)
but remember we first have to read first two words so read while spaces become two
echo or print such dobule word.
word reading may be like this
word length (wrdlen) is equal to i variable of for loop and after usage it become 0 when word is made by strings
tillword = substr(text, i-wrdlen, wordlen)
these are some hints i'm hanged up please help any one. so with the help of above hints I need these results form
I want to eat a mango.
I want
I
want to
want
to eat
to
eat mango
eat
mango
You may think this double word language philosophy from any secondary language you know that double word may contain single meaning, or some times single word is meaning less, like in English there is "to"
I am not sure if I correctly understand but from what I get you want to be able to translate on basis of a multiple word phrase instead of single words. This is kinda similar to what language parsers do while compiling or interpreting.
One simple way to implement this functionality would be to first break down the sentence into words. In python this can be done very simply with something like:
words = sentence.split(' ')
Now you can try parsing these words by looping through them and storing them in a queue. The trick is to remember what was entered and have defined rules.
Let me give you an example. Let's say your sentence is "to eat a mango"
The rules of your language translation are (assumed):
to eat - X
to drink - XZ
mango - Y
So you loop through the words and enter them into a queue. After performing this step your queue will have
mango
a
eat
to
You can then start popping out elements. The first element to pop out is 'to'. Now check if there are phrases that start with 'to' if so store it in a string and go to the next element which is 'eat'. Concatenate this with the original string with a space. So you get "to eat" which matches a rule "to get" -> X so now translate and return X.
Alternatively if hadn't had matched then you translate the original string "to" return it and create a new string with the new element and continue.
Hope this helps.
I want to highlight a group of words, they can appear single or in a row. I'd like them to be highlighted together if they appear one after the other, and if they don't, they should also be highlighted, like the normal behavior. For instance, if I want to highlight the words:
results as
And the subject is:
real time results: shows results as you type
I'd like the result to be:
real time results: shows <span class="highlighted"> results as </span> you type
The whitespaces are also a headache, because I tried using an or expression:
( results )|( as )
with whitespaces to prevent highlighting words like bass, crash, and so on. But since the whitespace after results is the same as the whitespace before as, the regexp ignores it and only highlights results.
It can be used to highlighted many words so combinations of
( (one) (two) )|( (two) (one) )|( one )|( two )
are not an option :(
Then I thought that there may be an operator that worked like | that could be use to match both if possible, else one, or the other.
Using spaces to ensure you match full words is the wrong approach. That's what word boundaries are for: \b matches a position between a word and a non-word character (where word characters usually are letters, digits and underscores). To match combinations of your desired words, you can simply put them all in an alternation (like you already do), and repeat as often as possible. Like so:
(?:\bresults\b\s*|\bas\b\s*)+
This assumes that you want to highlight the first and separate results in your example as well (which would satisfy your description of the problem).
Perhaps you do not need to match a string of words next to each other. Why not just apply your highlighting like so:
real time results: shows <span class="highlighted">results</span> <span class="highlighted">as</span> you type
The only realy difference is that the space between the words is not highlighted, but it's a clean and easy compromise which will save you hours of work and doesn't seem to hurt the UX in the least (in my opinion).
In that case, you could just use alternation:
\b(results|as)\b
(\b being the word boundary anchor)
If you really don't like the space between words not being highlight, you could write a jQuery function to find "highlighted" spans separated by only white space and then combine them (a "second stage" to achieve your UX design goals).
Update
(OK... so merging spans is actually kind of difficult via jQuery. See Find text between two tags/nodes)
I have a content description and few listed words ("Google" and "Gmail"). Now if these words appear in content description then I have to replace them with their links. I have created a regular expression and replaced them successfully using preg_match. But now I want to limit them. for example:
If 2 found words are very close them this will not be replaced.
My description is as follow:
"This is my description for Google and Gmail. I need to replace Google with its link and also Gmail"
Now my requirement is, First Gmail should not be replaced because first "Google" is very near to it (1 word distance only) and rest of the words should be replaced because the are very far then each other. So my result should be:
This is my description for Google and Gmail. I need to replace Google with its link and also Gmail.
I have used lookahead matching but it is not working.
Ok I got the solution.
I used preg_match_all for each word one by one, then maintained an array of matched words with offset (PREG_OFFSET_CAPTURE).
Now I managed a list of all matched words with position and sort that list according to word's weight. Now we can use any algorithm to track nearest replacement in text. I did the following:
1: Replace first list word in body and maintain a temp tracking array with position of this word.
2: For second word in list, first check the temp tracking array and find nearest position of second word. Now you can find words between first word and second word using str_word_count function.
3: Now do this for all words in list.
I am trying to build unique random phrases from text for detecting plagiarism. The idea is author will submit an article and then php will build phrases from text which will be used for plagiarism detection
Consider following sentence:
This is a very long and boring article and this article is plagiarized.
Based upon the above text, system will determine how many phrases will be generated i.e. 20 words long article will have 3 phrases. Max generated phrase can be minimum two words long and maximum 3 words long. The returned output will be like this
very long
article is plagiarized
I wrote following code
$words = str_word_count($text, 1);
$total_phrases_required = count($words) /2;
//build phrases
I need hint how to complete rest of the part.
You could break up text into two arrays of sentences and then use a function like the similar_text function to recursively check for similar strings.
Another idea, to find outright pauperism. You could break down text into sentences again. But then put into a database and run a query that selects count of index column and groups by sentence column. If any results comes back greater than 1, you to have an exact match for that sentence.
Basically - I want to calculate the "Proximity" of various terms.
By "proximity" I means Specifically the number of spaces/characters/words that sit between them.
Example:
Terms = Word1 / Word2
Chunk = "blah Word1 blah blah blah blah blah Word2 blah"
Proximity = Word1-Word2:5
THe script would see the 2 terms, locate them and then see the distance based on the words that lay between them.
A more advanced version would be to examine the semantic structure - and identify whether the terms occur within the same semantic element, or a sibling, or a parent etc.
Thus proximity discovery of terms may be within the same paragraph, or in sequential paragraphs, or under the same "parent" (heading) but otherwise separate etc.
Further - introducing things like word stemming/relationships/soundings at a later date may be useful too.
.
I've looked around the net (Google, here, php forums, php script sites).
Not seeing anything like it.
I can see tools on some sites that do similar (limited) - usually SEO based tools.
I want to be able to apply this to "text" in general ... as I may apply it to uploaded word/txt files etc.
I'm not seeing any real examples - so I can only assume it's mroe than a trifle to code it.
The question is - how can I do this?
How would I handle variant order of the words (Word1+Word2 / Word2+Word1)?
How could I handle identifying proximity within/outside of the same element/structure?
Hoping someone can shed some light/make some suggestions.
If you need to do a lot of this kind of search on a given text, you could begin by indexing the whole text into a database containing the word, its position in the text, and the paragraph number (if needed). Then, you could select all the Word1 and Word2 positions, and it shouldn't be too hard to infer the minimal distance.
Edit:
Here is a try for a simple algorithm for a one-shot, without using database.
Remove any html and punctuation to keep only the words
Search for the first occurrence of Word1
Count the number of words (or chars, or spaces) until you reach next occurrence of Word2
If you reach Word1 again before reaching Word2, restart the counter
Record the distance, then continue to repeat steps 2-5 to get other occurrences of Word1 and Word2