Parsing, formatting and generating data based on input

Parsing, formatting and generating data based on input - php

For some known inputs I have some known outputs/results. Based on this I want the program to generate result based on the input as per pre-filled input-results data.
Example input:
Enjoy your tea in the morning then have some bread in the lunch. Enjoy the taste of a garlic chicken in the dinner.
Your day starts with cold coffee. In the noon have some rice and fish curry.
Example output:
Have tea in the morning. Have some bread in the lunch. Have garlic chicken in the dinner.
Have cold coffee. Have some rice and fish curry.
I don't want to use string replace or regexp as it will break often. How or where do I start ?

If you have a large number of input and output pairs, you can treat this as a sequence to sequence task. The input can be considered your source and output can be considered as a target. You can easily develop a baseline model using OpenNMT.

Not really clear on your how to approach your specific problem, but let me go about a few ways to solve text related issues, since it seems to be what you are interested at.
Level 0 Static text hashing
IF, and that's a big if, your input is static, you could have digests maping inputs to outputs. But, as you mentioned, this is easily breakable. Even one extra space would result in a mismatch and that's why it's level 0.
Level 1 Pre-process your input:
Remove all extra spaces before, after and in-between words.
Remove stopwords from your input:
List of common stop-words https://www.textfixer.com/tutorials/common-english-words.txt
This step would transform your input to:
Enjoy tea morning bread lunch. Enjoy taste garlic chicken dinner.
day starts cold coffee. noon rice fish curry.
Next you could remove verbal conjugation, which doesn't apply to your example, but let's assume you had a sentences like:
drink tea, drank juice and drinks soda.
This sentence your become:
drink tea, drink juice drink soda
You could go even deeper and have synonyms normalization, example:
drink tea, sip water, slurped a juice, swallow beer
Then, all of them would become:
drink tea, drink water, drink juice, drink beer
After these steps are done, you have kind of a non statistical way of processing text. It all comes down to removing any redundancy and language flourish and getting down to the literal stuff.
And, of course, this approach loses a ton of the value contained in the english language. You can't tell sarcasm, you can't have analogies. So, this works for some domains, but it's not that advanced.
This approach is more about text processing and not language processing. See the difference?
If you need a smarter way to go about this, you should look into full text search algorithms
Level 2 Full text search algorithms
There are several ways to do this, here is one.
You've got a sentence like:
I want pizza
This search term would become
want piz za
And would search for
want piz
piz za
want za
This is super basic stuff, and it's just to show you how raw text processing works and ways you could go about this. Maybe you could have your inputs processed by level 1 to make them simpler and less variable and then have them processed by level 2 to be indexed in a db and then you have a nice way to query them
Level 3 NLP - Natural Language Processing
This is still not machine learning, but it is smarter and it's built on top of all the other steps. basically you would clean your inputs of nonsense and try to apply english gramatical structure to it.
To know more: https://dev.to/nicfoxds/getting-started-in-nlp-b0e
level 4 Deep learning stuff
Basically, google.
You get a bunch of text, a bunch of search queries, a bunch of user tracking data mapping queries to text. You feed all of that into a neural network and statistical models will detect patterns for you and make your search better as it goes.
Summary
If this is a project are serious about, look into NLU. It will give you a decent outcome as you track usage. Then, when you have enough user data, go for the deep learning stuff.
There's no easy way around this, you either do this by hand or implement a database that has some of those features, like elasticsearch. But as one of the comments mentioned, php is not a language for this.

If your input is truly known, then you can use str_replace() e.g.
$input = 'Enjoy your tea in the morning then have some bread in the lunch. Enjoy the taste of a garlic chicken in the dinner.
Your day starts with cold coffee. In the noon have some rice and fish curry.';
$old = array('Enjoy your ', ' then have ', '. Enjoy the taste of a ', 'Your day starts with ', '. In the noon have ');
$new = array('Have ' , '. Have ' , '. Enjoy ' , 'Have ' , '. Have ' );
$output = str_replace($old, $new, $input);
Beware of case sensitivity and things like spaces, periods and other punctuation.
If your input is less known, then you could use regex as you surmised.

Related

Translating strings with multiple sections needing pluralization

We're using the Symfony Translation component in our PHP application. It is capable of handling pluralisation in a very clever way, but as far as I can tell it can only handle a single "quantity" per string.
For example, it can translate
I have 3 apples.
Or
I have 1 orange.
But I can't work out a way to handle more complex strings like:
I have 3 apples and 1 orange.
Now, the obvious solution is to translate them separately and then join them together, but in my real life situation the strings are more complicated than this and according to our German team the order of the components cannot always be guaranteed to be the same. Sticking with my fake apples and oranges example, we could have the English string:
I'll have 3 apples each morning and 1 orange each weekday afternoon for the next 2 weeks.
I'd like to have a translation string like:
I'll have {{1 apple|%count_apples% apples}} each morning and {{1 orange|%count_oranges% oranges}} each weekday afternoon for the next {{1 week|%count_weeks% weeks}}.
And we need to consider that in another language, the structure of the sentence might only work if we use:
For the next 2 weeks, I'll have 3 apples each morning and 1 orange each weekday afternoon.
For the next {{1 week|%count_weeks% weeks}}, I'll have {{1 apple|%count_apples% apples}} each morning and {{1 orange|%count_oranges% oranges}} each weekday afternoon.
To complicate things, further, the word for "and" might change depending on if one of the quantities is a plural. Right now, we're only bothered about English and German but will need other languages in the mid-term future and then there isn't even just a singular and plural.
We're open to using something other than the Symfony Translation component for this section if required as it is quite self-contained.
Does anybody have any past experience in this, or ideas as to how to go about implementing this?

Convert text in specific format into real PHP code assignments

I'm having some problems to get a text in a specific format into real working PHP code.
My text file:
#T1:The German sociologist Max Weber once proposed
#S:Jos Bleau
#C:jos.bleau#domain.com
#L:"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic
#R:At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.
#ST:Fusions
#R:Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#ST:Fusions
#R:The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#BV:Thirty years later, Breidbart remembers
#CP:(Picture: Credit – Jos Bleau) or #CP:(Picture: Thanks)
The expected output I need (Half pseudo code; Unescaped quotes):
<?php
$title1 = 'The German sociologist Max Weber once proposed';
$signature = 'Jos Bleau';
$email = 'jos.bleau#domain.com';
$lead = '"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
$text[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
$subtitle[] = 'Fusions';
//etc...
?>
Note:
The names like $title1 and #T1 are completely unrelated to each other and $title1 is just used as example. It could also be $xy or something else
If #XY appears more than once in the file then the values should be added as array element, else as simple assignment
I don't know if preg_split() is the correct direction and I can do it with it? Or do I have to use other functions to accomplish this?

Explanation
First we get the data from the text file into a variable with file_get_contents() and also initialize our $output array, where each element is a line in the output, with a php tag <?php.
You can also modify $lookup with shortcut => variable name elements, where you can define which #XY: gets replaced with which variable name. If not defined the shortcut will be used as variable name.
Now that we have prepared some stuff we match each #XY: with the corresponding data with preg_match_all().
Regular Expression
/#(\w+):(.*?)(?=#\w+:)/s
\w+ matches all word characters \[a-zA-Z0-9_\], which is the XY part from #XY: and we keep it with a capturing group
+ is a quantifier and says that \w should match 1 or more times
(.*?) matches everything as much as needed
With the flag s, * also matches new lines
(?=#\w+:) makes sure (.*?) matches everything until the next #XY: and not more. Where ?= is a positive lookahead and as it says it looks ahead if that regex in the parentheses(#\w+) can be matched
We also preemptively save the amount each shortcut appears in the data with array_count_values().
Now that we have matched all data which we want we can loop through all shortcuts, which are saved in $m[1]. In the foreach loop we simply check if you have defined a lookup variable name or if we use the shortcut as variable name.
Then we simply add each assignment as new element to the output array. Where you have to note three things:
Complex (curly) syntax is used, so that you don't get problems with invalid variable names, see: How can I access a property with an invalid name?
Depending on how many times a shortcut appeared in the data we decide if it should be added as array element or normal assignment. If the shortcut appears more than once in the data it will be adding the value as array element else as simple string assignment
We use trim() to remove spaces, new lines, ... from the start and end of the string. And we use addslashes(), so we don't get problems with quotes
Done. And now we are already done. Just depending on how you want to output the result you can save it to a file with file_put_contents() or just print out the array.
Code
<?php
$text = file_get_contents("test.txt");
$output = ["<?php"];
$lookup = []; //Example: ["ST" => "subtitle"]
preg_match_all("/#(\w+):(.*?)(?=#\w+:)/s", $text, $m);
$variableShortcutCount = array_count_values($m[1]);
foreach($m[1] as $key => $variableShortcut){
if(isset($lookup[$variableShortcut])){
$output[] = '${"' . $lookup[$variableShortcut] . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
} else {
$output[] = '${"' . $variableShortcut . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
}
}
//Output to file
//file_put_contents("output.txt", implode(PHP_EOL, $output));
//Output to browser
echo "<pre><code>";
highlight_string(implode(PHP_EOL, $output));
?>
output:
<?php
${"T1"} = 'The German sociologist Max Weber once proposed';
${"S"} = 'Jos Bleau';
${"C"} = 'jos.bleau#domain.com';
${"L"} = '\"He used to be so conservative,\" she says, throwing up her hands in mock exasperation. \"We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
${"R"}[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn\'t borrow source code from past programs, and yet, with a single stroke of the president\'s pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab\'s scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard\'s father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son\'s aversion to authority. She can also attest to her son\'s lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"BV"} = 'Thirty years later, Breidbart remembers';
${"CP"} = '(Picture: Credit â€“ Jos Bleau) or';

Levenshtein - grouping hotel names

I have to group some hotel into the same category based on their names. I'm using levenshtein for grouping, but how much I've tried, some hotel are leaved outside the category they supposed to be, or in another category.
For example: all these hotel should be in the same category:
=============================
Best Western Bercy Rive Gauche
Best Western Colisee
Best Western Ducs De Bourgogne
Best Western Folkestone Opera
Best Western France Europe
Best Western Hotel Sydney Opera
Best Western Paris Louvre Opera
Best Western Hotel De Neuville
=============================
I'm having a list with all hotel names( like 1000 rows ). I also have how they should be grouped.
Any idea how to optimize levenshtein, making it more flexible for my situation?
$inserted = false;
foreach($hotelList as $key => $value){
if (levenshtein($key, $hotelName, 2, 5, 1) <= abs(strlen($key) - strlen($hotelName))){
array_push($hotelList[$key], trim($line));
$inserted = true;
}
}
// if no match was found add another entry
if (!$inserted){
$hotelList[$hotelName] = array(
trim($line)
);
}

I'll wade in with my thoughts. Firstly, grouping or "clustering" data like this is a pretty big topic, I won't really go into it particularly but perhaps point things in an ideal direction.
You did a brilliant thing by normalizing Levenshtein on the length of the strings compared- that's exactly right because you avoid the problem that the length of the string would overdetermine the similarity in many cases.
But the algorithm didn't solve the problem. For a start, we want to compare words. "Bent Eastern French Hotels" is obviously very different to "Best Western French Hotels", yet it would score better than "Best Western Paris Bed and Breakfasts", say. The intution to grasp here is that your tokens shouldn't be characters but words.
I like #saury's answer, but I'm not sure about the assumption at the beginning. Instead, let's start with something nice and easy often called "bag of words". We then implement a hashing trick, which would allow you to idetify the key phrases based on the intuition that the least used words contain the most information.
If you subscribe to the idea that hotel brand names are near the beginning you could always skew on their proximity to the start of the string too. Thing is, your groups will as likely end up being "France" as "Best" / "Western" (but not "hotel"- why?).
You want your results to be more accurate?
From here on in, we're gonna have to take a step up to some serious algorithms- enjoy surfing the many stack overflow topics. My instinct is that I bet many hotel names aren't branded at all, so you'll need different categories for them too. And my instinct is also that the number of repeated words in hotel names is going to be relatively slim- some words will be frequent members of hotel names. These facts would be problems for the above. In this case, there's a really popular (if cliched for SO) technique called k-means, a fun introduction to which would be to extend an algorithm like this (very bravely written in php) to take your chosen n keyphrases as the n dimensions of the cluster, then take the majority components of the cluster center-points as your categorization tags. (That would eliminate "France", say, because hits for "France" would be spread across the n-dimensional space pretty evenly).
This is probably all a bit much to take on for something that would seem like a small problem- but I want to emphasize that if your data isn't structured, there really aren't any short-cuts to doing things properly.

what levenshtein distance value do you take as the delta between words to be treated as part of same group ? Seems that you tend to group hotels based on the initial few words and that will require a different approach altogether (like do dictionary sort , compare current string with next strings etc). However if your use-case still requires to calculate levenshtein distance then I would suggest you to sort the Strings based on their length and then start comparing each string with other strings of similar length (apply you own heuristic to what you consider as 'similar' like you may say isSimilar = Math.abs(str1.length - str2.length) < SOME_LOWEST_DELTA_VALUE or something like that)

You might want to read about http://en.wikipedia.org/wiki/K-means_clustering and http://en.wikipedia.org/wiki/Cluster_analysis in general.

Identify single/multiple food elements inside a string (user input)

This is my first post after trying to find a solution to my question without luck.
I'll appreciate if you can help me :)
I'm trying to develope a solution were the user input what they have eaten for breakfast in a texbox, so lets say "an orange with toast bread and milk" and my app recognizes the food or identify them to see how many calories has each one from the following table:
Food - cooked - Calories
Orange cake - oven - 200
Cow Milk - raw - 50
Sheep Milk - raw - 40
Orange - juice - 15
cereal bread - toast - 10
bread - toast - 5
bacon - toast - 10
The solution I've made is a fulltext search for the whole string without doing any explode/implode functs. So the results I get are (by memory, so it's not accurate):
Fulltext rank - Food - cooked - Cal
10,523634 - bacon - toast - 10
5,2342342 - sheep milk - raw - 40
5,2342342 - cow milk - raw - 50
4,2342345 - cereal bread- toast - 10
3,2342344 - orange cake - oven - 200
2,2342342 - orange - juice - 15
$query="
SELECT Food, cooked,
MATCH ( Food, cooked)
AGAINST ( '$search' ) AS score
FROM food_table
WHERE
MATCH ( Food, cooked)
AGAINST ( '$search' )
ORDER BY Score
DESC LIMIT 50";
I discovered that some scores where the same, sheep milk and cow milk so I added a new row in mysql called "milk - average" to be the first solution in fulltext and then I delete the rest of "same rank" solutions (I don't have more info from the user, so I just make an average of calories from different kind of milks)
But still, this is not very accurate, for example, with orange or others, fulltext give me a wrong first option, "orange cake - oven" when I wanted to have just "orange - juice" that matches better (at least it matches one column perfectly). But still, the results are giving me multiple options for the same input and doing a score discrimination is not enough to let the app "understand" that if it's entered once, it shouldn't have two results with the same input.
Just in case if I explained myself wrongly, the final results I want are:
input:
an orange with toast bread and milk
Solution:
orange - juice - 15
bread - toast - 5
milk - average - 45 (this one, as said, is adding a new mysql row with the data)
Total: 65 calories
I don't want the code (if you have time is more than welcome) but the funcions I need to use for this purpose, or any other better way to do all of this, and I'll google it to understand.
The second part of the code is to identify the food even if they have any typo, for example oarnge. I think this is done with the Levenshtein distance not sure if I can apply the same solution for the whole need..
Thanks in advance!!

I think you have some options to solve your problem:
Writing a natural language parser
(NLP on Wikipedia)
You can use some parsing tools (just google nlp php) to map a phrase into a tree, do some part-of-speech tagging and so extract the words you need (maybe with their adjectives, so you can find if and how the food is cooked).
This way can be quite complex.
Limit user input
Only you know how your app is designed, but consider the possibility of changing the way the user can interact with it. You can force the user to click on a "add" button and select from a list of foods.
Somwhere in the middle
If you think that typing it's more natural and fast maybe you can find a compromise between the two above. Like asking the user to put commas between the "aliments" and/or implementing some sort of autocompletion.
In this case just some regular expressions can do the job.
For sure there are other paths to follow, like doing statistical nlp or using a dictionary to keep only useful words...
For what concerns typing errors: yes, Levenshtein distance is a widely used technique and you can use it (if you split the phrase in some manner so you have a string comparable to the Food column of your database).

Implementing Bayes classifier (in PHP)

I have a theoretical question about a Naive Bayes Classifier. Assume I have trained the classifier with the following training data:
class word count
-----------------
pos good 1
sun 1
neu tree 1
neg bad 1
sad 1
Assume I now classify "good sun great". There are now two options:
1) classify against the trainingdata, which remains static. Meaning both "good" and "sun" come from the positive category, classifying this string as a positive. After classification, the training table remains unchanged. All strings are thus classified against the static set of training data.
2) You classify the string, but then update the training data, as in the table underneath. Thus, the next string will be classified against a more "advanced" set of training data than this one. By the end of (automatic) classification, the table that started out as a simple training set, will have grown in size, having been expanded with many words (and updated word counts)
class word count
-----------------
pos good 2
sun 2
great 1
neu tree 1
neg bad 1
sad 1
In my implementation of NMB I used the first method, but I'm now second-guessing I should have done the latter. Please enlighten me :-)

The method you've implemented is indeed the popular and accepted way of building classifiers (and not just Bayesian ones).
Using "unlabeled" data, i.e. data you have no ground-truth about, to update the classifier, is a more advanced and complicated technique, sometimes called "semi-supervised learning".
Using this class of algorithms might or might not be a good fit to your specific task - it's usually a matter of trial and error.
If you do decide to incorporate unlabeled data into your model, you should probably try out one of the popular algorithms of doing that, e.g. EM.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.