I have some HTML/CSS/JavaScript with painfully long class, id, variable and function names and other, combined strings that get used over and over. I could probably rename or restructure a few of them and cut the text in half.
So I'm looking for a simple algorithm that reports on the longest repeated strings in text. Ideally, it would reverse sort by length times instances, so as to highlight to strings that, if renamed globally, would yield the most savings.
This feels like something I could do painfully in 100 lines of code, for which there's some elegant, 10-line recursive regex. It also sounds like a homework problem, but I assure you it's not.
I work in PHP, but would enjoy seeing something in any language.
NOTE: I'm not looking for HTML/CSS/JavaScript minification per se. I like meaningful text, so I want to do it by hand, and weigh legibility against bloat.
This will find all repeated strings:
(?=((.+)(?:.*?\2)+))
Use that with preg_match_all and select the longest one.
function len_cmp($match1,$match2) {
return $match2[0] - $match1[0];
}
preg_match_all('/(?=((.+)(?:.*?\2)+))/s', $text, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$match[0] = substr_count($match[1], $match[2]) * strlen($match[2]);
}
usort($matches, "len_cmp");
foreach ($matches as $match) {
echo "($matches[2]) $matches[1]\n";
}
This method could be quite slow though, as there could be a LOT of strings repeating. You could reduce it somewhat by specifying a minimum length, and a minimum number of repetitions in the pattern.
(?=((.{3,})(?:.*?\2){2,}))
This will limit the number of characters repeating to at least three, and the number of repetitions to three (first + 2).
Edit: Changed to allow characters between the repetitions.
Edit: Changed sorting order to reflect best match.
Seems I'm a little late, but it also does the work:
preg_match_all('/(id|class)+="([a-zA-Z0-9-_ ]+)"/', $html, $matches);
$result = explode(" ", implode(" ", $matches[2]));
$parsed = array();
foreach($result as $string) {
if(isset($parsed[$string])) {
$parsed[$string]++;
} else {
$parsed[$string] = 1;
}
}
arsort($parsed);
foreach($parsed as $k => $v) {
echo $k . " -> Found " . $v . " times<br/>";
}
The ouput will be something like:
some_id -> Found 2 times
some_class -> Found 2 times
Related
so i have this string and i want to output all of the words that ends in "ly".
$desi = "DESIDERATA
Go placidly amid the noise and the haste,
and remember what peace there may be in silence.
As far as possible, without surrender,
be on good terms with all persons.
Speak your truth quietly and clearly;
and listen to others,
even to the dull and the ignorant;
they too have their story.
Avoid loud and aggressive persons;
they are vexatious to the spirit.
If you compare yourself with others,
you may become vain or bitter,
for always there will be greater and lesser persons than yourself.
Enjoy your achievements as well as your plans.
Keep interested in your own career, however humble;
it is a real possession in the changing fortunes of time.
Exercise caution in your business affairs,
for the world is full of trickery.
But let this not blind you to what virtue there is;
many persons strive for high ideals,
and everywhere life is full of heroism.
Be yourself. Especially do not feign affection.
Neither be cynical about love,
for in the face of all aridity and disenchantment,
it is as perennial as the grass.
Take kindly the counsel of the years,
gracefully surrendering the things of youth.
Nurture strength of spirit to shield you in sudden misfortune. But do
not distress yourself with dark imaginings. Many fears are born of
fatigue and loneliness.
Beyond a wholesome discipline, be gentle with yourself.
You are a child of the universe no less than the trees and the
stars; you have a right to be here.
And whether or not it is clear to you,
no doubt the universe is unfolding as it should.
Therefore be at peace with God, whatever you conceive Him to be. And
whatever your labors and aspirations,
in the noisy confusion of life, keep peace in your soul.
With all its sham, drudgery, and broken dreams,
it is still a beautiful world.
Be cheerful. Strive to be happy.";
how can I get all the words that ends in "ly" in this string??
preg_match_all('/[\W]+([a-zA-Z]+ly)[\W]+/', $desi, $matches);
$matches = $matches[1];
var_dump($matches);
This is a simple solution I've come up with using regular expression. It groups the words ending with ly and adds them to the array of third parameter passed in preg_match_all function.
Hope it helps!
Here is one solution. First I made an array containing all words:
$words = explode(" ", $desi);
Then test each word and check if it ends with "ly" (using preg_match function)
foreach($words as $word){
if (preg_match("/ly$/", $word)){
echo $word." ";
}
}
I printed out the words, but you can do whatever you like such as save them in an array or ...
NOTE: this code fails to select "clearly" because of the ";" .
This may help you
$desi = "DESIDERATA
Go placidly amid the noise and the haste,
and remember what peace there may be in silence.
As far as possible, without surrender,
be on good terms with all persons.
Speak your truth quietly and clearly;";
$words_array = explode(' ',$desi);
print_r($words_array);
$needle = 'ly';
$result = array_find($needle,$words_array);
echo "<p>=============</p> ";
print_r($result);
function array_find($needle, array $haystack)
{
foreach ($haystack as $key => $value) {
if(substr($value, -2) != 'ly')
{
unset($haystack[$key]);
}
}
return $haystack;
}
Try this solution
$needle = 'ly';
$all = explode(' ', $desi);
$selected = [];
foreach ($all as $key => $value) {
if (substr($value, -strlen($needle)) === (string) $needle) {
$selected[] = $value;
}
}
var_dump($selected);
I have a Paragraph that I have to parse for different keywords. For example, Paragraph:
"I want to make a change in the world. Want to make it a better place to live. Peace, Love and Harmony. It is all life is all about. We can make our world a good place to live"
And my keywords are
"world", "earth", "place"
I should report whenever I have a match and how many times.
Output should be:
"world" 2 times and "place" 1 time
Currently, I am just converting Paragraph strings to array of characters and then matching each keyword with all of the array contents.
Which is wasting my resources.
Please guide me for an efficient way.( I am using PHP)
As #CasimiretHippolyte commented, regex is the better means as word boundaries can be used. Further caseless matching is possible using the i flag. Use with preg_match_all return value:
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred.
The pattern for matching one word is: /\bword\b/i. Generate an array where the keys are the word values from search $words and values are the mapped word-count, that preg_match_all returns:
$words = array("earth", "world", "place", "foo");
$str = "at Earth Hour the world-lights go out and make every place on the world dark";
$res = array_combine($words, array_map( function($w) USE (&$str) { return
preg_match_all('/\b'.preg_quote($w,'/').'\b/i', $str); }, $words));
print_r($res); test at eval.in outputs to:
Array
(
[earth] => 1
[world] => 2
[place] => 1
[foo] => 0
)
Used preg_quote for escaping the words which is not necessary, if you know, they don't contain any specials. For the use of inline anonymous functions with array_combine PHP 5.3 is required.
<?php
Function woohoo($terms, $para) {
$result ="";
foreach ($terms as $keyword) {
$cnt = substr_count($para, $keyword);
if ($cnt) {
$result .= $keyword. " found ".$cnt." times<br>";
}
}
return $result;
}
$terms = array('world', 'earth', 'place');
$para = "I want to make a change in the world. Want to make it a better place to live.";
$r = woohoo($terms, $para);
echo($r);
?>
I will use preg_match_all(). Here is how it would look in your code. The actual function returns the count of items found, but the $matches array will hold the results:
<?php
$string = "world";
$paragraph = "I want to make a change in the world. Want to make it a better place to live. Peace, Love and Harmony. It is all life is all about. We can make our world a good place to live";
if (preg_match_all($string, $paragraph, &$matches)) {
echo 'world'.count($matches[0]) . "times";
}else {
echo "match NOT found";
}
?>
Consider the following string
$input = "string with {LABELS} between brackets {HERE} and {HERE}";
I want to temporarily remove all labels (= whatever is between curly braces) so that an operation can be performed on the rest of the string:
$string = "string with between brackets and";
For arguments sake, the operation is concatenate every word that starts with 'b' with the word 'yes'.
function operate($string) {
$words = explode(' ', $string);
foreach ($words as $word) {
$output[] = (strpos($word, 0, 1) == 'b') ? "yes$word" : $word;
}
return implode(' ', $output);
}
The output of this function would be
"string with yesbetween yesbrackets and"
Now I want to insert the temporarily deleted labels back into place:
"string with {LABELS} yesbetween yesbrackets {HERE} and {HERE}"
My question is: how can I accomplish this? Important: I am not able to alter operate(), so the solution should contain a wrapper function around operate() or something. I have been thinking about this for quite a while now, but am confused as to how to do this. Could you help me out?
Edit: it would be too much to put the actual operate() in this post. It will not really add value (except make the post longer). There is not much difference between the output of operate() here and the real one. I will be able to translate any ideas from here, to the real-world situation :-)
The answer to this depends on wether or not you are able to understand operate(), even if you can't change it.
If you have absolutely no insight into operate(), your problem is simply unsolvable: To reinsert your labels you need one of
Their offset or relative position (You can't know them, if you don't know operate())
A marker for their place (You can't have them, if you don't know how operate() will work on them)
If you have at least some insight into operate(), this becomes something between solvable and easy:
If operate($a . $b)==operate($a) . operate($b), then you just split your original input by the labels, run the non-label parts through operate(), but obviously not the labels, then reassemble
If operate() is guaranteed to let a placeholder string, that itself is guaranteed to be not part of the normal input ("\0" and friends come to mind) alone, then you extract your labels in order, replace them by the placeholder, run the result through operate() and later replace the placeholder by your saved labels (in order)
Edit
After reading your comments, here are some lines of code
$input = "string with {LABELS} between brackets {HERE} and {HERE}";
//Extract labels and replace with \0
$tmp=preg_split('/(\{.*?\})/',$input,-1,PREG_SPLIT_DELIM_CAPTURE);
$labels=array();
$txt=array();
$islabel=false;
foreach ($tmp as $t) {
if ($islabel) $labels[]=$t;
else $txt[]=$t;
$islabel=!$islabel;
}
$txt=implode("\0",$txt);
//Run through operate()
$txt=operate($txt);
//Reasssemble
$txt=explode("\0",$txt);
$result='';
foreach ($txt as $t)
$result.=$t.array_shift($labels);
echo $result;
Here's what I would do as a first attempt. Split your string into single words, then feed them into operate() one by one, depending on whether the word is 'braced' or not.
$input = "string with {LABELS} between brackets {HERE} and {HERE}";
$inputArray = explode(' ',$input);
foreach($inputArray as $key => $value) {
if(!preg_match('/^{.*}$/',$value)) {
$inputArray[$key] = operate($value);
}
}
$output = implode(' ',$inputArray);
I'm setting up a Twitter-style "trending topics" box for my forum. I've got the most popular /words/, but can't even begin to think how I will get popular phrases, like Twitter does.
As it stands I just get all the content of the last 200 posts into a string and split them into words, then sort by which words are used the most. How can I turn this from most popular words into the most popular phrases?
One technique you might consider is the use of ZSETs in Redis for something like this. If you've got very large sets of data, you'll find that you can do something like this:
$words = explode(" ", $input); // Pseudo-code for breaking a block of data into individual words.
$word_count = count($words);
$r = new Redis(); // Owlient's PHPRedis PECL extension
$r->connect("127.0.0.1", 6379);
function process_phrase($phrase) {
global $r;
$phrase = implode(" ", $phrase);
$r->zIncrBy("trending_phrases", 1, $phrase);
}
for($i=0;$i<$word_count;$i++)
for($j=1;$j<$word_count - $i;$j++)
process_phrase(array_slice($words, $i, $j));
To retrieve the top phrases, you'd use this:
// Assume $r is instantiated like it is above
$trending_phrases = $r->zReverseRange("trending_phrases", 0, 10);
$trending_phrases will be an array of the top ten trending phrases. To do things like recent trending phrases (as opposed to a persistent, global set of phrases), duplicate all of the Redis interactions above. For each interaction, use a key that's indicative of, say, today's timestamp and tomorrow's timestamp (i.e.: days since Jan 1, 1970). When retrieving the results with $trending_phrases, just retrieve both today and tomorrow's (or yesterday's) key and use array_merge and array_unique to find the union.
Hope this helps!
Im not sure what type of answer you were looking for but Laconica:
http://status.net/?source=laconica
Is an open source twitter clone (a much simpler version).
Maybe you could use part of the code to make your own popular frases?
Good luck!
Instead of splitting individual words split individual phrases, it's as simple as that.
$popular = array();
foreach ($tweets as $tweet)
{
// split by common punctuation chars
$sentences = preg_split('~[.!?]+~', $string);
foreach ($sentences as $sentence)
{
$sentence = strtolower(trim($sentence)); // normalize sentences
if (isset($popular[$sentence]) === false)
//if (array_key_exists($sentence, $popular) === false)
{
$popular[$sentence] = 0;
}
$popular[$sentence]++;
}
}
arsort($popular);
echo '<pre>';
print_r($popular);
echo '</pre>';
It'll be a lot slower if you consider a phrase as an aggregation of n consecutive words.
I'm looking either for routine or way to look for error tolerating string comparison.
Let's say, we have test string Čakánka - yes, it contains CE characters.
Now, I want to accept any of following strings as OK:
cakanka
cákanká
ČaKaNKA
CAKANKA
CAAKNKA
CKAANKA
cakakNa
The problem is, that I often switch letters in word, and I want to minimize user's frustration with not being able (i.e. you're in rush) to write one word right.
So, I know how to make ci comparison (just make it lowercase :]), I can delete CE characters, I just can't wrap my head around tolerating few switched characters.
Also, you often put one character not only in wrong place (character=>cahracter), but sometimes shift it by multiple places (character=>carahcter), just because one finger was lazy during writing.
Thank you :]
Not sure (especially about the accents / special characters stuff, which you might have to deal with first), but for characters that are in the wrong place or missing, the levenshtein function, that calculates Levenshtein distance between two strings, might help you (quoting) :
int levenshtein ( string $str1 , string $str2 )
int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )
The Levenshtein distance is defined as
the minimal number of characters you
have to replace, insert or delete to
transform str1 into str2
Other possibly useful functions could be soundex, similar_text, or metaphone.
And some of the user notes on the manual pages of those functions, especially the manual page of levenshtein might bring you some useful stuff too ;-)
You could transliterate the words to latin characters and use a phonetic algorithm like Soundex to get the essence from your word and compare it to the ones you have. In your case that would be C252 for all of your words except the last one that is C250.
Edit The problem with comparative functions like levenshtein or similar_text is that you need to call them for each pair of input value and possible matching value. That means if you have a database with 1 million entries you will need to call these functions 1 million times.
But functions like soundex or metaphone, that calculate some kind of digest, can help to reduce the number of actual comparisons. If you store the soundex or metaphone value for each known word in your database, you can reduce the number of possible matches very quickly. Later, when the set of possible matching value is reduced, then you can use the comparative functions to get the best match.
Here’s an example:
// building the index that represents your database
$knownWords = array('Čakánka', 'Cakaka');
$index = array();
foreach ($knownWords as $key => $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (!isset($index[$code])) {
$index[$code] = array();
}
$index[$code][] = $key;
}
// test words
$testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa');
echo '<ul>';
foreach ($testWords as $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (isset($index[$code])) {
echo '<li> '.$word.' is similar to: ';
$matches = array();
foreach ($index[$code] as $key) {
similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage);
$matches[$knownWords[$key]] = $percentage;
}
arsort($matches);
echo '<ul>';
foreach ($matches as $match => $percentage) {
echo '<li>'.$match.' ('.$percentage.'%)</li>';
}
echo '</ul></li>';
} else {
echo '<li>no match found for '.$word.'</li>';
}
}
echo '</ul>';
Spelling checkers do something like fuzzy string comparison. Perhaps you can adapt an algorithm based on that reference. Or grab the spell checker guessing code from an open source project like Firefox.