Translating strings with multiple sections needing pluralization

Translating strings with multiple sections needing pluralization - php

We're using the Symfony Translation component in our PHP application. It is capable of handling pluralisation in a very clever way, but as far as I can tell it can only handle a single "quantity" per string.
For example, it can translate
I have 3 apples.
Or
I have 1 orange.
But I can't work out a way to handle more complex strings like:
I have 3 apples and 1 orange.
Now, the obvious solution is to translate them separately and then join them together, but in my real life situation the strings are more complicated than this and according to our German team the order of the components cannot always be guaranteed to be the same. Sticking with my fake apples and oranges example, we could have the English string:
I'll have 3 apples each morning and 1 orange each weekday afternoon for the next 2 weeks.
I'd like to have a translation string like:
I'll have {{1 apple|%count_apples% apples}} each morning and {{1 orange|%count_oranges% oranges}} each weekday afternoon for the next {{1 week|%count_weeks% weeks}}.
And we need to consider that in another language, the structure of the sentence might only work if we use:
For the next 2 weeks, I'll have 3 apples each morning and 1 orange each weekday afternoon.
For the next {{1 week|%count_weeks% weeks}}, I'll have {{1 apple|%count_apples% apples}} each morning and {{1 orange|%count_oranges% oranges}} each weekday afternoon.
To complicate things, further, the word for "and" might change depending on if one of the quantities is a plural. Right now, we're only bothered about English and German but will need other languages in the mid-term future and then there isn't even just a singular and plural.
We're open to using something other than the Symfony Translation component for this section if required as it is quite self-contained.
Does anybody have any past experience in this, or ideas as to how to go about implementing this?

Related

Parsing, formatting and generating data based on input

For some known inputs I have some known outputs/results. Based on this I want the program to generate result based on the input as per pre-filled input-results data.
Example input:
Enjoy your tea in the morning then have some bread in the lunch. Enjoy the taste of a garlic chicken in the dinner.
Your day starts with cold coffee. In the noon have some rice and fish curry.
Example output:
Have tea in the morning. Have some bread in the lunch. Have garlic chicken in the dinner.
Have cold coffee. Have some rice and fish curry.
I don't want to use string replace or regexp as it will break often. How or where do I start ?

If you have a large number of input and output pairs, you can treat this as a sequence to sequence task. The input can be considered your source and output can be considered as a target. You can easily develop a baseline model using OpenNMT.

Not really clear on your how to approach your specific problem, but let me go about a few ways to solve text related issues, since it seems to be what you are interested at.
Level 0 Static text hashing
IF, and that's a big if, your input is static, you could have digests maping inputs to outputs. But, as you mentioned, this is easily breakable. Even one extra space would result in a mismatch and that's why it's level 0.
Level 1 Pre-process your input:
Remove all extra spaces before, after and in-between words.
Remove stopwords from your input:
List of common stop-words https://www.textfixer.com/tutorials/common-english-words.txt
This step would transform your input to:
Enjoy tea morning bread lunch. Enjoy taste garlic chicken dinner.
day starts cold coffee. noon rice fish curry.
Next you could remove verbal conjugation, which doesn't apply to your example, but let's assume you had a sentences like:
drink tea, drank juice and drinks soda.
This sentence your become:
drink tea, drink juice drink soda
You could go even deeper and have synonyms normalization, example:
drink tea, sip water, slurped a juice, swallow beer
Then, all of them would become:
drink tea, drink water, drink juice, drink beer
After these steps are done, you have kind of a non statistical way of processing text. It all comes down to removing any redundancy and language flourish and getting down to the literal stuff.
And, of course, this approach loses a ton of the value contained in the english language. You can't tell sarcasm, you can't have analogies. So, this works for some domains, but it's not that advanced.
This approach is more about text processing and not language processing. See the difference?
If you need a smarter way to go about this, you should look into full text search algorithms
Level 2 Full text search algorithms
There are several ways to do this, here is one.
You've got a sentence like:
I want pizza
This search term would become
want piz za
And would search for
want piz
piz za
want za
This is super basic stuff, and it's just to show you how raw text processing works and ways you could go about this. Maybe you could have your inputs processed by level 1 to make them simpler and less variable and then have them processed by level 2 to be indexed in a db and then you have a nice way to query them
Level 3 NLP - Natural Language Processing
This is still not machine learning, but it is smarter and it's built on top of all the other steps. basically you would clean your inputs of nonsense and try to apply english gramatical structure to it.
To know more: https://dev.to/nicfoxds/getting-started-in-nlp-b0e
level 4 Deep learning stuff
Basically, google.
You get a bunch of text, a bunch of search queries, a bunch of user tracking data mapping queries to text. You feed all of that into a neural network and statistical models will detect patterns for you and make your search better as it goes.
Summary
If this is a project are serious about, look into NLU. It will give you a decent outcome as you track usage. Then, when you have enough user data, go for the deep learning stuff.
There's no easy way around this, you either do this by hand or implement a database that has some of those features, like elasticsearch. But as one of the comments mentioned, php is not a language for this.

If your input is truly known, then you can use str_replace() e.g.
$input = 'Enjoy your tea in the morning then have some bread in the lunch. Enjoy the taste of a garlic chicken in the dinner.
Your day starts with cold coffee. In the noon have some rice and fish curry.';
$old = array('Enjoy your ', ' then have ', '. Enjoy the taste of a ', 'Your day starts with ', '. In the noon have ');
$new = array('Have ' , '. Have ' , '. Enjoy ' , 'Have ' , '. Have ' );
$output = str_replace($old, $new, $input);
Beware of case sensitivity and things like spaces, periods and other punctuation.
If your input is less known, then you could use regex as you surmised.

Levenshtein - grouping hotel names

I have to group some hotel into the same category based on their names. I'm using levenshtein for grouping, but how much I've tried, some hotel are leaved outside the category they supposed to be, or in another category.
For example: all these hotel should be in the same category:
=============================
Best Western Bercy Rive Gauche
Best Western Colisee
Best Western Ducs De Bourgogne
Best Western Folkestone Opera
Best Western France Europe
Best Western Hotel Sydney Opera
Best Western Paris Louvre Opera
Best Western Hotel De Neuville
=============================
I'm having a list with all hotel names( like 1000 rows ). I also have how they should be grouped.
Any idea how to optimize levenshtein, making it more flexible for my situation?
$inserted = false;
foreach($hotelList as $key => $value){
if (levenshtein($key, $hotelName, 2, 5, 1) <= abs(strlen($key) - strlen($hotelName))){
array_push($hotelList[$key], trim($line));
$inserted = true;
}
}
// if no match was found add another entry
if (!$inserted){
$hotelList[$hotelName] = array(
trim($line)
);
}

I'll wade in with my thoughts. Firstly, grouping or "clustering" data like this is a pretty big topic, I won't really go into it particularly but perhaps point things in an ideal direction.
You did a brilliant thing by normalizing Levenshtein on the length of the strings compared- that's exactly right because you avoid the problem that the length of the string would overdetermine the similarity in many cases.
But the algorithm didn't solve the problem. For a start, we want to compare words. "Bent Eastern French Hotels" is obviously very different to "Best Western French Hotels", yet it would score better than "Best Western Paris Bed and Breakfasts", say. The intution to grasp here is that your tokens shouldn't be characters but words.
I like #saury's answer, but I'm not sure about the assumption at the beginning. Instead, let's start with something nice and easy often called "bag of words". We then implement a hashing trick, which would allow you to idetify the key phrases based on the intuition that the least used words contain the most information.
If you subscribe to the idea that hotel brand names are near the beginning you could always skew on their proximity to the start of the string too. Thing is, your groups will as likely end up being "France" as "Best" / "Western" (but not "hotel"- why?).
You want your results to be more accurate?
From here on in, we're gonna have to take a step up to some serious algorithms- enjoy surfing the many stack overflow topics. My instinct is that I bet many hotel names aren't branded at all, so you'll need different categories for them too. And my instinct is also that the number of repeated words in hotel names is going to be relatively slim- some words will be frequent members of hotel names. These facts would be problems for the above. In this case, there's a really popular (if cliched for SO) technique called k-means, a fun introduction to which would be to extend an algorithm like this (very bravely written in php) to take your chosen n keyphrases as the n dimensions of the cluster, then take the majority components of the cluster center-points as your categorization tags. (That would eliminate "France", say, because hits for "France" would be spread across the n-dimensional space pretty evenly).
This is probably all a bit much to take on for something that would seem like a small problem- but I want to emphasize that if your data isn't structured, there really aren't any short-cuts to doing things properly.

what levenshtein distance value do you take as the delta between words to be treated as part of same group ? Seems that you tend to group hotels based on the initial few words and that will require a different approach altogether (like do dictionary sort , compare current string with next strings etc). However if your use-case still requires to calculate levenshtein distance then I would suggest you to sort the Strings based on their length and then start comparing each string with other strings of similar length (apply you own heuristic to what you consider as 'similar' like you may say isSimilar = Math.abs(str1.length - str2.length) < SOME_LOWEST_DELTA_VALUE or something like that)

You might want to read about http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Cluster_analysis in general.

Implementing Bayes classifier (in PHP)

I have a theoretical question about a Naive Bayes Classifier. Assume I have trained the classifier with the following training data:
class word count
-----------------
pos good 1
sun 1
neu tree 1
neg bad 1
sad 1
Assume I now classify "good sun great". There are now two options:
1) classify against the trainingdata, which remains static. Meaning both "good" and "sun" come from the positive category, classifying this string as a positive. After classification, the training table remains unchanged. All strings are thus classified against the static set of training data.
2) You classify the string, but then update the training data, as in the table underneath. Thus, the next string will be classified against a more "advanced" set of training data than this one. By the end of (automatic) classification, the table that started out as a simple training set, will have grown in size, having been expanded with many words (and updated word counts)
class word count
-----------------
pos good 2
sun 2
great 1
neu tree 1
neg bad 1
sad 1
In my implementation of NMB I used the first method, but I'm now second-guessing I should have done the latter. Please enlighten me :-)

The method you've implemented is indeed the popular and accepted way of building classifiers (and not just Bayesian ones).
Using "unlabeled" data, i.e. data you have no ground-truth about, to update the classifier, is a more advanced and complicated technique, sometimes called "semi-supervised learning".
Using this class of algorithms might or might not be a good fit to your specific task - it's usually a matter of trial and error.
If you do decide to incorporate unlabeled data into your model, you should probably try out one of the popular algorithms of doing that, e.g. EM.

similar_search for physics units

I implemented the similar search very good. But there is one problem with units. Because units are prettey short the similar search is not that good.
I do create a recipe with:
1 kg Tomato
If the user is writing:
1 gk Tomato
the similar search is not that good. Is there a pretty fine way to do it? Right now I just use an array and compare the units. My array looks like this:
array(kg, gk, kilgramm)
If there is a match then take this unit. Is there a better way to do it?
Thanks!

As long as you're looking at just a small number of terms, preferably short ones, you can use the levenshtein algorithm to find the cost of transforming one string into another. It's less expensive than similar_text, so if that works, levenshtein will probably work fine as well.

How would you make this simple scheduler?

I'm working on an admin section right now to schedule employees for their shifts. It is simple in that on any day, employees work either the day or the night shift. How would you go about doing this so that it is easy to display? I was thinking I have a table with the employee names going down the first column. Then, the next 7 columns are the upcoming 7 days. Each cell would be a drop down with Not Scheduled, Day Shift and Night Shift as the three choices. Is this how you'd do it? I've never done anything like this so I could really use some advice.
Thanks!

For something like employees you're probably going to see a lot of text no matter what. Why not just list in times and out times, sorted by time?
John Smith in 7:30
Mary Anne in 7:30
John Smith out 4:00
Mary Anne out 4:00
You could color-code in vs out. You could also break it down to 5 10 15 or 30 minute increments for sections, though i'm not sure what the value would be. Maybe faster visual reference?
Do something like this for a "day" view which can be drilled in from week/month, etc. While you're at it you could easily create an employee schedule view that could be a little bit more graphical.
Edit:
I suppose the above doesn't address your question so much as it exemplifies what you can do with the data once you have it.
For scheduling you could create a pretty basic form comprised of a sort of lookup/autocomplete field for the employee name, and a date field and 2 time fields (in/out). If you use something like jQuery dialog, you could insert these schedules directly from any other schedule view to help you see what you're looking for. You could pretty easily fit 7 columns wide of lists of in and out times to represent a week in sort of a "spread" view. You could have options to page by week or by day. In paging by day you would have sort of a rotation capability, which might be handy if the end of one week can impact the beginning of the next.

I think, as the admin, I could prefer something just a bit simpler. A drop-down with Day or Night; and the ability to choose neither, if no shift.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.