Evaluate if string is not in English: best and easiest practices? - php
I have long enough string (5000+ chars), and I need to check if it is in English.
After brief web search I found several solutions:
using of PEAR Text_LanguageDetect (it looks attractive but I'm still avoiding solutions which I don't understand how thet works)
check letters frequency (I made a function below with some comments)
check the string for national charecters (like č, ß and so on)
check the string for markers like 'is', 'the' or anything
So the function is the following:
function is_english($str){
// Most used English chars frequencies
$chars = array(
array('e',12.702),
array('t', 9.056),
array('a', 8.167),
array('o', 7.507),
array('i', 6.966),
array('n', 6.749),
array('s', 6.327),
array('h', 6.094),
array('r', 5.987),
);
$str = strtolower($str);
$sum = 0;
foreach($chars as $key=>$char){
$i = substr_count($str,$char[0]);
$i = 100*$i/strlen($str); // Normalization
$i = $i/$char[1];
$sum += $i;
}
$avg = $sum/count($chars);
// Calculation of mean square value
$value = 0;
foreach($chars as $char)
$value += pow($char[2]-$avg,2);
// Average value
$value = $value / count($chars);
return $value;
}
Generally this function estimates the chars frequency and compares it with given pattern. Result should be closer to 0 as the frequency closer the pattern.
Unfortunately it working not as good: mostly I could consider that results 0.05 and lower is English and higher is not. But there are many English strings have high values and many foreign (in my case mostly German) - low.
I can't implement Third solution yet as I wasn't able to find any comprehensive chars set - foreign language markers.
The forth looks attractive but I can not figure out which marker is best to be used.
Any thoughts?
PS After some discussion Zod proposed that this question is duplicate to question Regular expression to match non-English characters?, which answers only in part. So I'd like to keep this question independent.
I think the fourth solution might be your best bet, but I would expand it to include a wider dictionary.
You can find some comprehensive lists at: https://en.wikipedia.org/wiki/Most_common_words_in_English
With your current implementation, you will suffer some setbacks because many languages use the standard latin alphabet. Even languages that go beyond the standard latin alphabet typically use primarily "English-compliant characters," so to speak. For example, the sentence "Ich bin lustig" is German, but uses only latin alphabetic characters. Likewise, "Jeg er glad" is Danish, but uses only latin alphabetic characters. Of course, in a string of 5000+ characters, you will probably see some non-latin characters, but that is not guaranteed. Additionally, but focusing solely on character frequency, you might find that foreign languages which utilize the latin alphabet typically have similar character occurrence frequencies, thus rendering your existing solution ineffective.
By using an english dictionary to find occurrences of English words, you would be able to look over a string and determine exactly how many of the words are English, and from there, calculate a frequency of the number of words that are English. (With a higher percentage indicating the sentence is probably English.)
The following is a potential solution:
<?php
$testString = "Some long string of text that you would like to test.";
// Words from: https://en.wikipedia.org/wiki/Most_common_words_in_English
$common_english_words = array('time', 'person', 'year', 'way', 'day', 'thing', 'man', 'world', 'life', 'hand', 'part', 'child', 'eye', 'woman', 'place', 'work', 'week', 'case', 'point', 'government', 'company', 'number', 'group', 'problem', 'fact', 'be', 'have', 'do', 'say', 'get', 'make', 'go', 'know', 'take', 'see', 'come', 'think', 'look', 'want', 'give', 'use', 'find', 'tell', 'ask', 'seem', 'feel', 'try', 'leave', 'call', 'good', 'new', 'first', 'last', 'long', 'great', 'little', 'own', 'other', 'old', 'right', 'big', 'high', 'different', 'small', 'large', 'next', 'early', 'young', 'important', 'few', 'public', 'bad', 'same', 'able', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from', 'up', 'about', 'into', 'over', 'after', 'beneath', 'under', 'above', 'the', 'and', 'a', 'that', 'i', 'it', 'not', 'he', 'as', 'you', 'this', 'but', 'his', 'they', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all', 'would', 'there', 'their', 'I', 'we', 'what', 'so', 'out', 'if', 'who', 'which', 'me', 'when', 'can', 'like', 'no', 'just', 'him', 'people', 'your', 'some', 'could', 'them', 'than', 'then', 'now', 'only', 'its', 'also', 'back', 'two', 'how', 'our', 'well', 'even', 'because', 'any', 'these', 'most', 'us');
/* you might also consider replacing "'s" with ' ', because 's is common in English
as a contraction and simply removing the single quote could throw off the frequency. */
$transformedTest = preg_replace('#\s+#', ' ', preg_replace("#[^a-zA-Z'\s]#", ' ', strtolower($testString)));
$splitTest = explode(' ', $transformedTest);
$matchCount = 0;
for($i=0;$i<count($splitTest);$i++){
if(in_array($splitTest[$i], $common_english_words))
$matchCount++;
}
echo "raw count: $matchCount\n<br>\nPercent: " . ($matchCount/count($common_english_words))*100 . "%\n<br>\n";
if(($matchCount/count($common_english_words)) > 0.5){
echo "More than half of the test string is English. Text is likely English.";
}else{
echo "Text is likely a foreign language.";
}
?>
You can see an example here which includes two sample strings to test (one which is German, and one which is English): https://ideone.com/lfYcs2
In the IDEOne code, when running it on the English string, you will see that the result is roughly 69.3% matching with the common English words. When running it on the German, the match percentage is only 4.57% matching with the common English words.
This problem is called language detection and is not trivial to solve with a single function. I suggest you use LanguageDetector from github.
i would go with the fourth solution and try to also search for not englisch. For Example if you find "the" then high posibility for english. If you find "el" or "la" then the posibility is high for spanish. I would search for "der","die"and "das" then it is very posible that it is German.
Related
Search mySQL for special unicode characters with mysqli with PHP
I have a search autocomplete feature which breaks when someone types a French characters, like É É is stored like '\u00c9' - a unicode codepoint - in the mySQL table: 'id', 'term', 'count', 'words', 'locale' '5218', '\u00c9COLORADO', '4', '1', 'fr-ca' '5590', '\u00c9MADEUP', '1', '1', 'fr-ca' '5511', 'EXCITE', '1', '1', 'fr-ca' In the PHP, É is '\xc3\x89'. I wrote the code below to convert it to unicode for the query so it would match. On my system, json_encode() outputted "\\u00c9" so I had to str_replace() some of those additional characters $andrew = json_encode($criteria); $temp2 = str_replace('"', "", $temp1); $temp3 = str_replace('\\\\', '\\', $temp2); $data = self::all( array( 'locale' => $locale , 'term' => array('$like' => $temp3."%" ) ), array('count'=>0,'term'=>2),0,12 ); When I type É in the search and error_log() the SQL query, it is: SELECT * FROM search_term WHERE `locale` = 'fr-ca' AND `term` LIKE '\\\\u00c9%' ORDER BY `count` DESC, `term` ASC, When I run that SQL query in mySQL Workbench, it works (the quadruple backslashes are necessary in the case of LIKE) and the result set is: 'id', 'term', 'count', 'words', 'locale' '5218', '\u00c9COLORADO', '4', '1', 'fr-ca' '5590', '\u00c9MADEUP', '1', '1', 'fr-ca' But when I run that query in PHP with mysqli: $res = mysqli_query($conn, $query); it doesn't return any results/matches. How or why does mysqli_query() change the query so it fails? How do I write this so that when the search character is É it matches with that character - how its stored - in the database?
json_encode($str, JSON_UNESCAPED_UNICODE) Add that flag so that you will get the letter, not the Unicode code.
Generating 5 digit alphanumeric code with predifined characters in sequence
I would like to generate a membership number consisting of alphanumeric characters, removing i o and l to save confusion when typing. to be done in php (also using Laravel 5.7 if that matters - but i feel this is a php question) If simply using 0-9 the membership number would start at 00001 for the 1st one and the 11th person would have 00011. I would like to use alphanumeric characters from 0-9 + a-z (removing said letters) 0-9 (total 10 characters), abcdefghjkmnpqrstuvwxyz (total 23 characters) - this giving a total of 33 characters in each count cycle (0-10+a-Z). instead of just 10 (0-10) So the first membership number would still be 00001 where as the 12th would now be 0000a, 14th 0000c and 34th would be 0001a. To summarize i need a way of defining the characters for counting in a way that can be generated based on the id of a user. I hope I have explained this well enough.
Assuming that these are the only characters you want to use: 0123456789abcdefghjkmnpqrstuvwxyz You can use base_convert() and strtr() to translate specific characters of the result to the characters you want. function mybase33($number) { return str_pad(strtr(base_convert($number, 10, 33), [ 'i' => 'j', 'j' => 'k', 'k' => 'm', 'l' => 'n', 'm' => 'p', 'n' => 'q', 'o' => 'r', 'p' => 's', 'q' => 't', 'r' => 'u', 's' => 'v', 't' => 'w', 'u' => 'x', 'v' => 'y', 'w' => 'z', ]), 5, '0', STR_PAD_LEFT); } echo "9 is ".mybase33(9)."\n"; echo "10 is ".mybase33(10)."\n"; echo "12 is ".mybase33(12)."\n"; echo "14 is ".mybase33(14)."\n"; echo "32 is ".mybase33(32)."\n"; echo "33 is ".mybase33(33)."\n"; echo "34 is ".mybase33(34)."\n"; Output: 9 is 00009 10 is 0000a 12 is 0000c 14 is 0000e 32 is 0000z 33 is 00010 34 is 00011 https://3v4l.org/8YtaR Explanation The output of base_convert() uses these characters: 0123456789abcdefghijklmnopqrstuvw The strtr() translates specific characters of that output to: 0123456789abcdefghjkmnpqrstuvwxyz
PHP combinations using all words each time [duplicate]
This question already has answers here: Combinations Of String While Maintaining Order Of Words (3 answers) Closed 8 years ago. This is my first question at this site, so I hope that I will be specific enough with this. I need to transform a text string into several array with all different combinations of the 'words' and 'word phrases' in the text string. So string would be like: "Football match France 2013" From this I want the following array: array( 0 => array( 'Football', 'match', 'France', '2013' ), 1 => array( 'Football', 'match', 'France 2013' ), 2 => array( 'Football', 'match France', '2013' ), 3 => array( 'Football', 'match France 2013' ), 4 => array( 'Football match', 'France', '2013' ), 5 => array( 'Football match', 'France 2013', ), 6 => array( 'Football match France', '2013' ), 7 => array( 'Football match France 2013', ), ) So the restriction that a each result string string may consist of 1 to n consecutive words and that in total each sub array should contain each word one time.
Here is some code that works. <?php $str = 'Football match France 2013'; // Initialize sentence $words = explode(" ",$str); // Break sentence into words $p = array(array(array_shift($words))); // Load first word into permutation that has nothing to connect to foreach($words as $word) { // for each remaining word $a = $p; // copy existing permutation for not-connected set $b = $p; // copy existing permutation for connected set $s = count($p); // cache number of items in permutation $p = array(); // reset permutation (attempt to force garbage collection before adding words) for($i=0;$i<$s;$i++) { // loop through each item $a[$i][] = $word; // add word (not-connected) $b[$i][count($b[$i])-1] .= " ".$word; // add word (connected) } $p = array_merge($a,$b); // create permutation result by joining connected and not-connected sets } // Dump the array print_r($p); ?>
Find unique combinations of values from arrays filtering out any duplicate pairs
Using php, I am looking to find a set of unique combinations of a specified length while making sure that no two identical values are present in more than one combination. For example, if I want to find all unique combinations of 3 values (fallback to combinations of 2 values if 3 are not possible) with this array: $array = array( array('1', '2'), array('3', '4'), array('5', '6'), ); One possible set of combinations is 123, 456, 14, 15, 16, 24, 25, 26, 34, 35, 36 Note that each number is always combined once and only once with a different number. No duplicate number pairs show up in any combination. Just for clarity's sake, even though 123 and 135 would be unique combinations, only one of these would be returned since the pair 13 occurs in both. The main criteria is that all numbers are eventually grouped with each other number, but only once. In the final product, the number of arrays and number of values will be notably larger as in: $array = array( array('1', '2', '3', '4', '5', '6', '7', '8'), array('9', '10', '11', '12', '13', '14', '15', '16'), array('17', '18', '19', '20', '21', '22', '23', '24'), array('25', '26', '27', '28', '29', '30', '31') ); Any help/code to accomplish this would be most appreciated. UPDATE: I've taken the brute force approach. First off, I'm using the pear package Math_Combinatorics to create the combinations, starting with a specified maximum size grouping and working my way down to pairs. This way I can get all possible combinations when iterating through to strip out any duplicate clusters within the groups. This code works but is extremely memory intensive. Generating all combinations for an array of 32 values in groups of 6 uses an excess of 1.5G of memory. Is there a better algorithm or approach that will let me use bigger arrays without running out of memory? Here the current state of the code: require_once 'Combinatorics.php'; $combinatorics = new Math_Combinatorics; $array = range(1,20,1); $maxgroup = (6); $combinations = $combinatorics->combinations($array, $maxgroup); for($c=$maxgroup-1;$c>1;$c--) { $comb = $combinatorics->combinations($array, $c); $combinations = array_merge($combinations, $comb); $comb = null; } for($j=0;$j<sizeof($combinations);$j++) { for($i=sizeof($combinations)-1;$i>=$j+1;$i--) { $diff = array_intersect($combinations[$j], $combinations[$i]); if(count($diff)>1) { unset($combinations[$i]); } } $combinations = array_values($combinations); } print_r($combinations);
Since the structure is just obscuring the numbers which are available, you should first unfold the nested arrays. I'll be kind and do that for you: $numbers = [] foreach ($arrar as $subarr) { foreach ($subarr as $num) { $numbers[] = $num; } } I'm assuming there aren't any duplicate numbers in the input. Next, you want to perform your algorithm for finding the unique combinations. With array this small, even a recursive solution will work. You don't have to try all the combinatorially-many combinations.
Django, Python: Is there a simple way to convert PHP-style bracketed POST keys to multidimensional dict?
Specifically, I got a form that calls a Django service (written using Piston, but I don't think that's relevant), sending via POST something like this: edu_type[3][name] => a edu_type[3][spec] => b edu_type[3][start_year] => c edu_type[3][end_year] => d edu_type[4][0][name] => Cisco edu_type[4][0][spec] => CCNA edu_type[4][0][start_year] => 2002 edu_type[4][0][end_year] => 2003 edu_type[4][1][name] => fiju edu_type[4][1][spec] => briju edu_type[4][1][start_year] => 1234 edu_type[4][1][end_year] => 5678 I would like to process this on the Python end to get something like this: edu_type = { '3' : { 'name' : 'a', 'spec' : 'b', 'start_year' : 'c', end_year : 'd' }, '4' : { '0' : { 'name' : 'Cisco', 'spec' : 'CCNA', 'start_year' : '2002', 'end_year' : '2003' }, '1' : { 'name' : 'fiju', 'spec' : 'briju', 'start_year' : '1234', 'end_year' : '5678' }, }, } Any ideas? Thanks!
I made a little parser in python to handle multidimensional dicts, you can find it at https://github.com/bernii/querystring-parser
Dottedish does something like what you want. http://pypi.python.org/pypi/dottedish. It doesn't really have a homepage but you can install it from pypi or download the source from github. >>> import dottedish >>> dottedish.unflatten([('3.name', 'a'), ('3.spec', 'b')]) {'3': {'name': 'a', 'spec': 'b'}}
I am riding off of the previous response by Atli about using PHP's json_encode... A Python dict in its most basic form is syntactically identical to JSON. You could easily perform an eval() on a JSON structure to create a Python dict: >>> blob = """{ ... '3' : { 'name' : 'a', 'spec' : 'b', 'start_year' : 'c', 'end_year' : 'd' }, ... '4' : { ... '0' : { 'name' : 'Cisco', 'spec' : 'CCNA', 'start_year' : '2002', 'end_year' : '2003' }, ... '1' : { 'name' : 'fiju', 'spec' : 'briju', 'start_year' : '1234', 'end_year' : '5678' }, ... }, ... }""" >>> edu_type = eval(blob) >>> edu_type {'3': {'end_year': 'd', 'start_year': 'c', 'name': 'a', 'spec': 'b'}, '4': {'1': {'end_year': '5678', 'start_year': '1234', 'name': 'fiju', 'spec': 'briju'}, '0': {'end_year': '2003', 'start_year': '2002', 'name': 'Cisco', 'spec': 'CCNA'}}} Now, is this the best way? Probably not. But it works and doesn't resort to regular expressions, which might technically be better but is definitely not a quicker option considering the time spent debugging and troubleshooting your pattern matching. JSON is a good format to use for interstitial data transfer. Python also has a json module as part of the Standard Library. While that is more picky about the output you're parsing, it is more certainly the better way to go about it (albeit with more work involved).
okay, so this is ghetto as hell, but here goes: let's say your input is a list of tuples. say: input = [('edu_type[3][end_year]', 'd'), ...] from collections import defaultdict from re import compile def defdict(): return defaultdict(defdict) edu_type = defdict() inputs = [(x.replace('[', '["').replace(']', '"]'), y) for x, y in input] for input in inputs: exec = '%s = "%s"' % input Note that you should only use this if you trust the source of your input, as it is nowhere near safe.