Given a list of numbers predict the next in sequence - php

I'm using PHP, I have a list of numbers with a min of 1 and a max of 10:
1,2,4,10,4,3,1,6,9,8,2,10,5,6,7,3,1...
Is there a way to find the next logic number in the sequence (or at least the possible number/s)?
I think I can loop thru the array and find the one that came up least, but I'm not sure it will be working.

You can create a list of functions that test that list of numbers for a specific pattern yes, however that is much different that what humans do which is to "discover" a pattern. Humans also test for previous patterns they have seen in the past, however we are capable of discovering a pattern we haven't seen before with the algorithms inside are head. If you want the code to try to discover patterns in your list of numbers, that would be Artificial Intelligence coding. It very much does exist, though it's a big topic all together.
I hope that explanation helps :)
Edited:
Here's a link if you are interested in knowing more about Artificial Intelligence coding:
https://www.youtube.com/watch?v=TjZBTDzGeGg&list=PLUl4u3cNGP63gFHB6xb-kVBiQHYe_4hSi

Related

Dutch (or German) compound words in search functions (in PHP)

I have been having an issue with building a search function for a while now that I'm building for a cooking blog.
In Dutch (similar to German), one can add as many compound words together to create a new word. This has been giving me a headache when wanting to include search results that include a relevant singular word inside compound words. It's kind of like a reverse Scunthorpe problem, I actually want to include certain words inside other words, but only sometimes.
For example, the word rice in Dutch is rijst. Brown rice is zilvervliesrijst and pandan rice is pandanrijst. If I want these two to pop up in search results, I have to search whether words exist inside a word, rather than whether they are the word.
However, this immediately causes issues for smaller words that can exist inside other words accidentally. For example, the word for egg is ei, while leek is prei. Onion is ui, while Brussel sprouts are spruitjes. You can see that accepting subsections of strings being matching the search strings could cause major problems.
I initially tried to grade what percentage of a word contains the search string, but this also causes issues as prei is 50% ei, while zilvervliesrijst is only about 25% rijst. This also makes using a levenshtein distance to solve this very impractical.
My current solution is as follows: I have an SQL table list of ingredients that are being used to automatically calculate the price and calorie total for each recipe based on the ingredient list, and I have used this to add all relevant synonyms to the name column. Basically, zilvervliesrijst is listed as zilvervliesrijst|rijst. I also use this to add both the plural and singular version of a term such that I will not have to test those.
However, this excludes any compound words in any place other than the ingredient list. Things such as title, cuisine, cooking equipment, dietary preferences and so on are still having this problem.
My question is this, is there a non-library-esque method that addresses this within the field of computer science? Or will I be doomed to include every single possible searchable compound word and its singular components, every time I want to add in a new recipe? I just hope that's not the case, as that will massively increase the processing time required for each additional library entry.
I think it will be hard to do this well without using a library, and probably also a dictionary (which may be bundled as part of the library).
There are really two somewhat orthogonal problems:
Splitting compound words into their constituent parts.
Identifying the stem of a simple (non-compound) word. (For example, removing plural markers and inflections.) This is often called "stemming" but that's not really the best strategy; you'll also find the rather awkward term "lemmatization".
Both of these tasks are plagued with ambiguities in all the languages I know about. (A German example, taken from an Arxiv paper describing the German-language morphological analyser DEMorphy, is "Rohrohrzucker", which means "raw cane sugar" -- Roh Rohr Zucker -- but could equally be split into Rohr Ohr Zucker, pipe-ear sugar, if there were such a thing.)
The basic outline of how these tasks can be done in reasonable time (with lots of CPU power) is:
Using ngram analysis to figure out plausible word division points.
Lemmatize each candidate component word to get plausible POS (part-of-speech) markers.
Use a trained machine-learning model (or something of that form) to reject non-sensical (or at least highly improbable) divisions.
At each step, check possible corner cases in a dictionary (of corner cases).
That's just a rough outline, of course.
I was able to find, without too much trouble, a couple of fairly recent discussions of how to do this with Dutch words. I'm not even vaguely competent to discuss the validity of these papers, so I'll leave you to do the search yourself. (I used the search query "split compound words in Dutch".) But I can tell you two things:
The problem is being worked on, but not necessarily to produce freely-available products.
If you choose to tackle it yourself, you'll end up devoting quite a lot of time to the project, although you might find it interesting. If you do succeed, you'll end up with a useful product and the beginning of a thesis (perhaps useful if you have academic ambitions).
However you choose to do it, you're best off only doing it once for each new recipe. Analyse the contents of each recipe as it is entered, to build a list of search terms which you can store in your database along with the recipe. You will probably also want to split and lemmatize search queries, but those are generally short enough that the CPU time is reasonable. Even so, consider caching the analyses in order to save time on common queries.

Cutting Stock Problem

I'm trying to nest material with the least drop or waste.
Table A
Qty Type Description Length
2 W 16x19 16'
3 W 16x19 12'
5 W 16x19 5'
2 W 5x9 3'
Table B
Type Description StockLength
W 16X19 20'
W 16X19 25'
W 16X19 40'
W 5X9 20'
I've looked all over looking into Greedy Algorithms, Bin Packing, Knapsack, 1D-CSP, branch and bound, Brute force, and others. I'm pretty sure it is a Cutting stock problem. I just need help coming up with the function(s) to run this. I don't just have one stock length but multiple and a user may enter his own inventory of less common lengths. Any help at figuring a function or algorithm to use in PHP to come up with the optimized cutting pattern and stock lengths needed with the least waste would be greatly appreciated.
Thanks
If your question is "gimme the code", I am afraid that you have not given enough information to implement a good solution. If you read the whole of this answer, you will see why.
If your question is "gimme the algorithm", I am afraid you are looking for an answer in the wrong place. This is a technology-oriented site, not an algorithms-oriented one. Even though we programmers do of course understand algorithms (e.g., why it is inefficient to pass the same string to strlen in every iteration of a loop, or why bubble sort is not okay except for very short lists), most questions here are like "how do I use API X using language/framework Y?".
Answering complex algorithm questions like this one requires a certain kind of expertise (including, but not limited to, lots of mathematical ability). People in the field of operations research have worked in this kind of problems more than most of us ever has. Here is an introductory book on the topic.
As an engineer trying to find a practical solution to a real-world problem, I would first get answers for these questions:
How big is the average problem instance you are trying to solve? Since your generic problem is NP-complete (as Jitamaro already said), moderately big problem instances require the use of heuristics. If you are only going to solve small problem instances, you might be able to get away with implementing an algorithm that finds the exact optimum, but of course you would have to warn your users that they should not use your software to solve big problem instances.
Are there any patterns you could use to reduce the complexity of the problem? For example, do the items always or almost always come in specific sizes or quantities? If so, you could implementing a greedy algorithm that focuses on yielding high-quality solutions for common scenarios.
What would be your optimality vs. computational efficiency tradeoff? If you only need a good answer, then you should not waste mental or computational effort in trying to provide an optimal answer. Information, whether provided by a person of by a computer, is only useful if it is available when it is needed.
How much are your customers willing to pay for a high-quality solution? Unlike database or Web programming, which can be done by practically everyone because algorithms are kept to a minimum (e.g. you seldom code the exact procedure by which a SQL database provides the result of a query), operations research does require both mathematical and engineering skills. If you are not charging for them, you are losing money.
This looks to me like a variation of a 1d bin-packing. You may try a best-fit and then try it with different sorting of the table b. Anyway there doesn't exist an solution in 3/2 of the optimum and because this is a NP-complete problem. Here is a nice tutorial: http://m.developerfusion.com/article/5540/bin-packing. I used a lot to solve my problem.

How can I create a threshold for similar strings using Levenshtein distance and account for typos?

We recently encountered an interesting problem at work where we discovered duplicate user submitted data in our database. We realized that the Levenshtein distance between most of this data was simply the difference between the 2 strings in question. That indicates that if we simply add characters from one string into the other then we end up with the same string, and for most things this seems like the best way for us to account for items that are duplicate.
We also want to account for typos. So we started to think about on average how often do people make typos online per word, and try to use that data within this distance. We could not find any such statistic.
Is there any way to account for typos when creating this sort of threshold for a match of data?
Let me know if I can clarify!
First off, Levenshtein distance is defined as the minimum number of edits required to transform string A to string B, where an edit is the insertion, or deletion of a single character, or the replacement of a character with another character. So it's very much the "difference between two strings", for a certain definition of distance. =)
It sounds like you're looking for a distance function F(A, B) that gives a distance between strings A and B and a threshold N where strings with distance less than N from each other are candidates for typos. In addition to Levenshtein distance you might also consider Needleman–Wunsch. It's basically the same thing but it lets you provide a function for how close a given character is to another character. You could use that algorithm with a set of weights that reflect the positions of keys on a QWERTY keyboard to do a pretty good job of finding typos. This would have issues with international keyboards though.
If you have k strings and you want to find potential typos, the number of comparisons you need to make is O(k^2). In addition, each comparison is O(len(A)*len(B)). So if you have a million strings you're going to find yourself in trouble if you do things naively. Here are a few suggestions on how to speed things up:
Apologies if this is obvious, but Levenshtein distance is symmetrical, so make sure you aren't computing F(A, B) and F(B, A).
abs(len(A) - len(B)) is a lower bound on the distance between strings A and B. So you can skip checking strings whose lengths are too different.
One issue you might run into is that "1st St." has a pretty high distance from "First Street", even though you probably want to consider those to be identical. The easiest way to handle this is probably to transform strings into a canonical form before doing the comparisons. So you might make all strings lowercase, use a dictionary that maps "1st" to "first", etc. That dictionary might get pretty big, but I don't know a better way to deal with this issues.
Since you tagged this question with php, I'm assuming you want to use php for this. PHP has a built-in levenshtein() function but both strings have to be 255 characters or less. If that's not long enough you'll have to make your own. Alternatively, you investigate using Python's difflib.
You should check out this book:
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Has a good chapter (3.3) on spell checking
The references at the end of the chapter lists some papers that discuss probabilistic models
Good luck

check if a name seems "human"?

I have an online RPG game which I'm taking seriously. Lately I've been having problem with users making bogus characters with bogus names, just a bunch of different letters. Like Ghytjrhfsdjfnsdms, Yiiiedawdmnwe, Hhhhhhhhhhejejekk. I force them to change names but it's becoming too much.
What can I do about this?
Could I somehow check so at least you can't use more than 2 of the same letter beside each other?? And also maybe if it contains vowels
I would recommend concentrating your energy on building a user interface that makes it brain-dead easy to list all new names to an administrator, and a big fat "force to rename" mechanism that minimizes the admin's workload, rather than trying to define the incredibly complex and varied rules that make a name (and program a regular expression to match them!).
Update - one thing comes to mind, though: Second Life used to allow you to freely specify a first name (maybe they check against a database of first names, I don't know) and then gives you a selection of a few hundred pre-defined last names to choose from. For an online RPG, that may already be enough.
You could use a metaphone implementation and then look for "unnatural" patterns:
http://www.php.net/manual/en/function.metaphone.php
This is the PHP function for metaphone string generation. You pass in a string and it returns the phonetic representation of the text. You could, in theory, pass a large number of "human" names and then store a database of valid combinations of phonemes. To test a questionable name, just see if the combinations of phonemes are in the database.
Hope this helps!
Would limiting the amount of consonants or vowels in a row, and preventing repeating help?
As a regex:
if(preg_match('/[bcdfghjklmnpqrtsvwxyz]{4}|[aeiou]{4}|([a-z])\1{2}/i',$name)){
//reject
}
Possibly use iconv with ASCII//TRANSLIT if you allow accentuated characters.
What if you would use the Google Search API to see if the name returns any results?
I say take #Unicron's approach, of easy admin rejection, but on each rejection, add the name to a database of banned names. You might be able to use this data to detect specific attacks generation large numbers of users based on patterns. Will of course be very difficult to detect one-offs.
I had this issue as well. An easy way to solve it is to force user names to validate against a database of world-wide names. Essentially you have a database on the backend with a few hundred thousand first and last names for both genders, and make their name match.
With a little bit of searching on google, you can find many name databases.
Could I somehow check so at least you cant use more than 2 of the same letter beside each other?? and also maybe if it contains vowels
If you just want this, you can do:
preg_match('/(.)\\1\\1/i', $name);
This will return 1 if anything appears three times in a row or more.
This link might help. You might also be able to plug it through a (possibly modified) speech synthesiser engine and analyse how much trouble it's having generating the speech, without actually generating it.
You should try implementing a modified version of a Naive Bayes spam filter. For example, in normal spam detection you calculate the probability of a word being spam and use individual word probabilities to determine if the whole message is spam.
Similarly, you could download a word list, and compute the probability that a pair of letters belongs to a real word.
E.g., create a 26x26 table say, T. Let the 5th row represent the letter e and let entry T(5,1) be the number of times ea appeared in your word list. Once you're done counting, divide each element in each row with the sum of the row so that T(5,1) is now the percentage of times ea appears in your word list in a pair of letter starting with e.
Now, you can use the individual pair probability (e.g. in Jimy that would be {Ji,im,iy} to check whether Jimy is an acceptable name or not. You'll probably have to determine the right probability to threshold at, but try it out --- it's not that hard to implement.
What do you think about delegating the responsibility of creating users to a third party source (like Facebook, Twitter, OpenId...)?
Doing that will not solve your problem, but it will be more work for a user to create additional accounts - which (assuming that the users are lazy, since most are) should discourage the creation of additional "dummy" users.
It seems as though you are going to need a fairly complex preg function. I don't want to take the time to write one for you, as you will learn more writing it yourself, but I will help along the way if you post some attempts.
http://php.net/manual/en/function.preg-match.php

PHP - How to suggest terms for search, "did you mean...?"

When searching the db with terms that retrieve no results I want to allow "did you mean..." suggestion (like Google).
So for example if someone looks for "jquyer"
", it would output "did you mean jquery?"
Of course, suggestion results have to be matched against the values inside the db (i'm using mysql).
Do you know a library that can do this? I've googled this but haven't found any great results.
Or perhaps you have an idea how to construct this on my own?
A quick and easy solution involves SOUNDEX or SOUNDEX-like functions.
In a nutshell the SOUNDEX function was originally used to deal with common typos and alternate spellings for family names, and this function, encapsulates very well many common spelling mistakes (in the english language). Because of its focus on family names, the original soundex function may be limiting (for example encoding stops after the third or fourth non-repeating consonant letter), but it is easy to expend the algorithm.
The interest of this type of function is that it allows computing, ahead of time, a single value which can be associated with the word. This is unlike string distance functions such as edit distance functions (such as Levenshtein, Hamming or even Ratcliff/Obershelp) which provide a value relative to a pair of strings.
By pre-computing and indexing the SOUNDEX value for all words in the dictionary, one can, at run-time, quickly search the dictionary/database based on the [run-time] calculated SOUNDEX value of the user-supplied search terms. This Soundex search can be done systematically, as complement to the plain keyword search, or only performed when the keyword search didn't yield a satisfactory number of records, hence providing the hint that maybe the user-supplied keyword(s) is (are) misspelled.
A totally different approach, only applicable on user queries which include several words, is based on running multiple queries against the dictionary/database, excluding one (or several) of the user-supplied keywords. These alternate queries' result lists provide a list of distinct words; This [reduced] list of words is typically small enough that pair-based distance functions can be applied to select, within the list, the words which are closer to the allegedly misspelled word(s). The word frequency (within the results lists) can be used to both limit the number of words (only evaluate similarity for the words which are found more than x times), as well as to provide weight, to slightly skew the similarity measurements (i.e favoring words found "in quantity" in the database, even if their similarity measurement is slightly less).
How about the levenshtein function, or similar_text function?
Actually, I believe Google's "did you mean" function is generated by what users type in after they've made a typo. However, that's obviously a lot easier for them since they have unbelievable amounts of data.
You could use Levenshtein distance as mgroves suggested (or Soundex), but store results in a database. Or, run separate scripts based on common misspellings and your most popular misspelled search terms.
http://www.phpclasses.org/browse/package/4859.html
Here's an off-the-shelf class that's rather easy to implement, which employs minimum edit distance. All you need to do is have a token (not type) list of all the words you want to work with handy. My suggestion is to make sure it's the complete list of words within your search index, and only within your search index. This helps in two ways:
Domain specificity helps avoid misleading probabilities from overtaking your implementation
Ex: "Memoize" may be spell-corrected to "Memorize" for most off-the-shelf, dictionaries, but that's a perfectly good search term for a computer science page.
Proper nouns that are available within your search index are now accounted for.
Ex: If you're Dell, and someone searches for 'inspiran', there's absolutely no chance the spell-correct function will know you mean 'inspiron'. It will probably spell-correct to 'inspiring' or something more common, and, again, less domain-specific.
When I did this a couple of years ago, I already had a custom built index of words that the search engine used. I studied what kinds of errors people made the most (based on logs) and sorted the suggestions based on how common the mistake was.
If someone searched for jQuery, I would build a select-statement that went
SELECT Word, 1 AS Relevance
FROM keywords
WHERE Word IN ('qjuery','juqery','jqeury' etc)
UNION
SELECT Word, 2 AS Relevance
FROM keywords
WHERE Word LIKE 'j_query' OR Word LIKE 'jq_uery' etc etc
ORDER BY Relevance, Word
The resulting words were my suggestions and it worked really well.
You should keep track of common misspellings that come through your search (or generate some yourself with a typo generator) and store the misspelling and the word it matches in a database. Then, when you have nothing matching any search results, you can check against the misspelling table, and use the suggested word.
Writing your own custom solution will take quite some time and is not guaranteed to work if your dataset isn't big enough, so I'd recommend using an API from a search giant such as Yahoo. Yahoo's results aren't as good as Google's but I'm not sure whether Google's is meant to be public.
You can simply use an Api like this one https://www.mashape.com/marrouchi/did-you-mean

Categories