String Compression Class - php

I am trying to make a String compression system that could compress string with often used word in it.
But i have no idea on how i could make the logic work.
I was thinking of replacing world that apear often by a simple <1> and put that word in a array so that when we a reading the string we can see that <1> should be the first word in the array or some what.
But that is not my problem at the current moment.
Im trying to figure out how i could actually calculate how many time this word is appearing.
and i can't really use an explode(' ',$str); and check how many time it is there since i would like to check not only world but everything such as if there is allways a space between two world i would like to have them to store in my array also.
All of that in the idea of compressing a string.
Im am not looking for code tho, Im am simply trying to find a good logic i could make this work
Any one have an idea of how i could achieve that.
Thanks for any comment/awnser

I think the only way to do this is a sliding window... Hopefully you are using small strings :)
So, let's say your string was.
"Joey Novak Needs More Reputation :)"
We start with a 10 character string, and search the string for other instances of that string. So the first 10 character string would be "Joey Novak", Then we search the remainder of the string for that string. If we find one, awesome! We replace it with the marker (<1> works.) and search again, if we don't, we move on to the next string, which would be "oey Novak " and do the same, etc... When we finish with all the 10 character strings, we move on to 9 character, and work our way down. Since the marker is 3 characters long, you only need to go to 4 character strings.
Joey

Related

Detect repetition in text string / copied text

I have an input form where users can upload a test report, minimum length is 100 words. Some users write less than this, and simply copy what they wrote until the threshold of 100 words is met.
I would like to test (ideally via php) that a text string contains repeated text, i.e. where subsets of this string are copied.
I was thinking to make a fourier analysis of the text, which could give rise to text repetitions inside the string.
Does a php class or regex example exist for this purpose?
Some sample text:
blabla bla. this is some text now I am getting bored. this is some
text now I am getting bored. this is some text now I am getting bored.
this is some text now I am getting bored. this is some text now I am
getting bored. some stuff in the end.
Update: My proposal to solve this is as follows
1) Map the string to an array of integers, i.e. find a numeric representation for every character. So the sample above would become
numerics = array ( 2, 5, 1, 2, 5, 1, ...);
2) Apply fourier transform on this array to get the "character frequency spectrum"
FT = fft (numerics);
This detects regular patterns in the character space.
e.g. one could use this class to compute the fft.
3) Detect peaks of the function FT. Measure the relative height of the peaks, compared to the noise in the background.
4) Set a threshold for the peaks. If any peak is above this threshold, then return that regular patterns in the text have emerged. e.g. the repetition of sentences several times should clearly mark a high peak at a certain frequency.
As this proposal would be quite straight forward in data analytics, I wonder whether it has not been coded before. So that was my purpose of asking here, if anybody knows if such an algorithm already exists in the open source.
Of course, alternative solutions / proposals how to solve this problem would be appreciated.
There is no existing function or libary that detects repeating strings in a way you like to have. You can break down the problem to an algorythm, that starts with one word, than two words ect. but this will be very much work for this.
Your customers will start copying non-repeating sentenses and you'll have another problem, you cannot solve.
You have to manage your testers, options to punish them for illegal entries.

Regular Expression (regex) match of base64_decode concatenated using PHP

So i've been trying to build a regex for the past couple hours and i'm starting to go crazy in thinking if this is even possible or worth wild.
I have a script that scans PHP files checking MD5 sum for known malicious files, and certain strings. Most recently i've come across files where instead of using base64_decode in the PHP file, they are using variables and concatenating it so the scanner doesn't pick it up.
As an example here's the latest one I found:
$a='bas'.'e6'.'4_d'.'ecode';eval($a
So because the scanner searches for base64_decode this file wasn't picked up as they are using PHP to concatenate base64_decode in a variable, and then call the variable.
Forgive me because i've just started with regex, but is it even possible to search for something like this using regex? I mean, I understand and was able to get a regex that would match that exact one, but what about if they used this instead:
$a='b'.'ase'.'64_d'.'ecode';eval($a
It wouldn't be picked up because the regex was looking for ' then b then a, etc etc.
I've already added
(eval)\(\$[a-z]
To send me an email as a notice to check the file, i'll have to let it run for a couple days and see how many false positives show up, but my main concern is with the base64_decode
If someone could please shed some light on this for me and maybe point me in the right direction, I would greatly appreciate it.
Thanks!!
You can use this regexp:
b\W*a\W*s\W*e\W*6\W*4\W*_\W*d\W*e\W*c\W*o\W*d\W*e
It searches for base64_decode with any non-alphanumeric characters interspersed.

Displaying words after taking data (user chosen word length) from a txt file PHP

I am a bit new to PHP and have a question that I would like to get some different ideas on. I am writing a PHP script that will open as a html form with an intro section, a few radio buttons, and a submit button. When the user clicks submit, there is a static text file that I have chosen(ebook) that I want the PHP script to remove words from the file depending on the length in which the user chooses (radio buttons are labeled 1,2,3,etc).
This is what I have been playing with:
<?php
$myTextFile = "ebook.txt";
$fileContents = get_file_contents($myTextFile);
$txtTok = strtok("allremoveddelimiters", $myTextFile);
while ($txtTok != false){
echo $txtTok;
$myTok = strtok("allremoveddelimiters");
}
?>
I have gotten my string stripped of all delimiters chosen, and it echoes out all on one line. This is where I have become stuck. What I am needing to do now is to pull words with a certain length (user chosen through radio buttons) out of the text file and then print the changed txt file to the screen. I have looked at the explode() function, but I am not sure that I really understand it. I also played with the str_replace() but it didn't seem to do what I needed it to, or I just didn't understand the complete function. Also, I was advised to not use the preg_() functions. I am assuming that I would want to take the txt file and put all the words in to an array and then pull the words of the user chosen length out and then print the updated array back as a string, but so far I haven't found an example that was understandable by my novice skills.
A push in any direction will be certainly appreciated. If any other information is needed, please advise.
Example:
The full string would contain this -
YOU don't know about me without you have read a book by the name of The
Adventures of Tom Sawyer; but that ain't no matter. That book was made
by Mr. Mark Twain, and he told the truth, mainly. There was things which
he stretched, but mainly he told the truth. That is nothing.
And if the user chooses 3 from the choices -
don't know about me without have read a book by name of Adventures of Sawyer; that ain't no matter. That book made by Mr. Mark Twain, he told truth, mainly. There things which he stretched, mainly he told truth. That is nothing.
Thinking about this quickly, I would recommend looking at regular expressions. This should easily allow you to extract words based on length. They can be quite complicated to understand though.
I would push you in the direction of preg_replace, http://php.net/manual/en/function.preg-replace.php. This will allow you (once you have your working regular expression) to extract words of the specified length, and replace them with an empty string.
I would recommend looking at a site like http://public.kvalley.com/regex/regex.asp for testing created regular expressions.
Having taken a quick look, I believe the following expression matches 4 letter words:
\b(\w{4})\b
Replace the 4 in the string with the length of the word you want to match. To break down the Regex simply, \b stands for a word boundary (ie space, full stop), \w stands for a word character (letters or digits), and {4} means match the previous item 4 times (so a 4 letter word). I believe the following will work.
$myReplacedString = preg_replace('/\b(\w{4})\b/', '', $myString);
$myTrimmedString = trim(preg_replace('/\s\s+/', ' ', $myReplacedString));
where $myString is your string you've fetched from your file.
I hope this explains it well.

Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly.
I want to separate each word, assign them an ID and save the result in JSON format.
Sample text:
"Hello, how are you (today)"
This is what im doing at the moment:
$document_array = explode(' ', $document_text);
json_encode($document_array);
The resulting JSON is
[["Hello,"],["how"],["are"],["you"],["(today)"]]
How do I ensure that spaces are kept in-place and that symbols are not included along with the words...
[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]
I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?
This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)
This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.
Maybe this:?
array_filter(preg_split('/\b/', $document_text))
the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

Regex to check if exact string exists

I am looking for a way to check if an exact string match exists in another string using Regex or any better method suggested. I understand that you tell regex to match a space or any other non-word character at the beginning or end of a string. However, I don't know exactly how to set it up.
Search String: t
String 1: Hello World, Nice to see you! t
String 2: Hello World, Nice to see you!
String 3: T Hello World, Nice to see you!
I would like to use the search string and compare it to String 1, String 2 and String 3 and only get a positive match from String 1 and String 3 but not from String 2.
Requirements:
Search String may be at any character position in the Subject.
There may or may not be a white-space character before or after it.
I do not want it to match if it is part of another string; such as part of a word.
For the sake of this question:
I think I would do this using this pattern: /\bt\b/gi
/\b{$search_string}\b/gi
Does this look right? Can it be made better? Any situations where this pattern wouldn't work?
Additional info: this will be used in PHP 5
Your suggestion of /\bt\b/gi will work and is probably the way to go. You've correctly used \b for word boundaries. You're using the global and case-insensitive modifiers which will find all matches in both cases. Simple, straight forward, clean. Look no further than what you've already come up with.
Looks fine to me. You might want to check the exact meaning of the \b assertion to make sure it's exactly what you need.
Can't really name any situation where this pattern "wouldn't work" without a more elaborate description, but \b would work fine for your testcases.
According to the old saying give a man a reg expression and he is happy for a day, teach him to write regular expression and he is happy for a lifetime (or something to that effect) try out the "regulator"
It provides a GUI and some pretty good examples for reg exp needs.

Categories