Intersection of two regular expressions

Intersection of two regular expressions - php

Im looking for function (PHP will be the best), which returns true whether exists string matches both regexpA and regexpB.
Example 1:
$regexpA = '[0-9]+';
$regexpB = '[0-9]{2,3}';
hasRegularsIntersection($regexpA,$regexpB) returns TRUE because '12' matches both regexps
Example 2:
$regexpA = '[0-9]+';
$regexpB = '[a-z]+';
hasRegularsIntersection($regexpA,$regexpB) returns FALSE because numbers never matches literals.
Thanks for any suggestions how to solve this.
Henry

For regular expressions that are actually regular (i.e. don't use irregular features like back references) you can do the following:
Transform the regexen into finite automata (the algorithm for that can be found here(chapter 9) for example).
Build the intersection of the automata (You have a state for each state in the cartesian product of the states of the two automata. You then transition between the states according to the original automata's transition rules. E.g. if you're in state x1y2, you get the input a, the first automaton has a transition x1->x4 for input x and the second automaton has y2->y3, you transition into the state x4y3).
Check whether there's a path from the start state to the end state in the new automaton. If there is, the two regexen intersect, otherwise they don't.

Theory.
Java library.
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}

A regular expression specifies a finite state machine that can recognize a potentially infinite set of strings. The set of strings may be infinite but the number of states must be finite, so you can examine the states one by one.
Taking your second example: In the first expression, to get from state 0 to state 1, the string must start with a digit. In the second expression, to get from state 0 to state 1, the string must start with a letter. So you know that there is no string that will get you from state 0 to state 1 in BOTH expressions. You have the answer.
Taking the first example: You know that if the string starts with a digit you can get from state 0 to state 1 with either regular expression. So now you can eliminate state 0 for each, and just answer the question for each of the two (now one state smaller) finite-state-machines.
This looks a lot like the well-known "halting problem", which as you know is unsolvable in the general case for a Turing machine or equivalent. But in fact the halting problem IS solvable for a finite-state machine, simply because there are a finite number of states.
I believe you could solve this with a non-deterministic FSM. If your regex had only one transition from each state to the next, a deterministic FSM could solve it. But a regular expression means that for instance if you are in state 2, then if the caracter is a digit you go to state 3, else if the character is a letter you go to state 4.
So here's what I would do:
Solve it for the subset of FSM's that have only one transition from one state to the next. For instance a regex that matches both "Bob" and "bob", and a second regex that matches only "bob" and "boB".
See if you can implement the solution in a finite state machine. I think this should be possible. The input to a state is a pair representing a transition for one FSM and a transition for the second one. For instance: State 0: If (r1, r2) is (("B" or "b"), "b") then State 1. State 1: If (r1, r2) is (("o"), ("o")) then state 2. etc.
Now for the more general case, where for instance state two goes back to state two or an earlier state; for example, regex 1 recognizes only "meet" but regex 2 recognizes "meeeet" with an unlimited number of e's. You would have to reduce them to regex 1 recognizing "t" and regex 2 recognizing "t". I think this may be solvable by a non-deterministic FSM.
That's my intuition anyway.
I don't think it's NP-complete, just because my intuition tells me you should be able to shorten each regex by one state with each step.

It is possible. I encountered it once with Pellet OWL reasoner while learning semantic web technologies.
Here is an example that shows how regular expressions can be parsed into a tree structure. You could then (in theory) parse your two regular expressions to trees and see if one tree is a subset of the other tree, ie. if one tree can be found in within other tree's nodes.
If it is found, then the other regular expression will match (not only, but also) a subset of what the first regular expression will match.
It is not much of a solution, but maybe it'll help you.

Related

Finding out sequence similarity in arrays

I have a task where I have three arrays A,B,C. All of the contain the same data. For the sake of simplicity lets assume the data is numbers 1 to 5. The data would be in different jumbled sequences. I want to find out among B & C which array has data most similar to A.
Eg:
A = 1,2,3,4,5
B = 1,2,3,5,4
C = 4,1,2,3,5
In this case, it is easy to visually comprehend that B is more similar to A. But it gets more complicated for really jumbled sequences.
Eg:
A = 1,2,3,4,5
B = 5,3,1,4,2
C = 4,1,2,3,5
In this case, I would assume C to be more closer to A. I am thinking that this assumption can be quantified as: How many elements have the same sequence in both arrays? In above example the subsequence of [1,2,3] is the same in both arrays. The second question would be what is the offset difference between the similar subsequence ? In this case it is 1, because the subsequence begins at index 0 for A and index 1 for C.
So the number of elements in a matching sequence and their offsets are what I am thinking to use. I plan on adding a weightage to these two entities (number of elements in matching sequence, and offset difference in their occurrence)
Does this make sense? I only need a rough approximation of similarity and the results do not need to be exact. Are there any formal mathematical or data-structure models that solve this problem?
BTW, the project where I need this implemented is in PHP. Does it have any inbuilt functions like the levenstein model for string difference?
Any suggestions are very welcome!

Well I suppose you can come up with your own algorithm (for instance generate all suffixes and then search for them and then define a scoring procedure) or you could use a well known algorithm like
Smith-Waterman for local alignment or Needleman-Wunsch for global. The advantage of these algorithms is that they are well-understood and give you all the possible alignments (and you can choose the best for your case).
NW in PHP
SW in PHP

From 6 random numbers calculate random three-digit number?

I have 4 years PHP and C# experience, but Math is not my better side.
I thnik that i need in this project use some math algorithms.
When page load I need randomly create 7 numbers, 6 are numbers that I can use to calculate given three digit number:
rand 1-9
rand 1-9
rand 1-9
rand 1-9
rand 10-100 //5 steps
rand 10-100 //5 steps
and given number to calculate is 100-999,
I can use this operations: +, -, /, *, (, )
What is best algorithm for this?
I probably need to try all possible combinations with this 6 numbers to calculate given number or closest number of calculations.
example:
let say that given three digit number is
350, and I need to calculate this number from this numbers:
3,6,9,5 10, 100
so formula for this is:
(100*3)+(5*10) = 350
if is not possible to calculate exact number, than calculate closest.
You don't need to solve this problem completely, you can introduce me to solve this problem by paste some pseudo, or describing how to do that.

I have no actual experience that might help you with this, though since you're asking for some insight, I'll share my thoughts on how to do this.
As I typed my answer, I realised that this is in fact a knapsack problem, which means you can solve it to optimality using any algorithm that solves the knapsack problem. I recommend using dynamic programming to make your program run faster.
What you need to do is construct all numbers you can generate by combining two numbers with an operator, so that after this you have a list containing the numbers you started with, and the numbers you generated.
Then you solve the knapsack problem using the numbers as items with their value as their weight, and the number as the weight you can store at most.
The only thing that is slightly different is that you have an extra constraint that says that you may only use a number once. So you need to add into your implementation that if you add a combination of numbers, that you must remove the option of storing another combination that is constructed with the same number.

You could enumerate all the solutions by building "Abstract syntax trees", binary trees with the following informations :
the leaves are the 6 numbers
the nodes are the operations, for example a node '+' with the leaf '7' for left son and another node for right son that is 'x' with '140' for left son and '8' for right son would represent (7+(140*8)). Additionally, at each node you store the numbers that you already used (the leaves used in the tree), and the total.
Let's say you store all the constructed trees in the associative map TreeSets, but indexed by the number of leaves you use. For example, the tree (7+(140*8)) would not be stored directly in TreeSets but in TreeSets[3] (TreeSets[3] contains several trees, it is also a set).
You store the most close score in BestScore and one solution of the BestScore in BestSolution.
You start by constructing the 6 leaves (that makes you 6 different trees consisting of only one leaf). You save the closer number in Bestscore and the corresponding leaf in BestSolution.
Then at each step, you try to construct the trees with i leaves, i from 2 to 6, and store them in TreeSets[i].
You take j from 1 to i-1, you take each tree in TreeSets[j] and each tree in TreeSets[i-j], you check that those two trees don't use the same leaves (you don't have to check at the bottom of the tree since you have stored the leaves used in the node), if so you build the four nodes '+', 'x', '/', '-' with the tree from TreeSets[j] as left son and the tree from TreeSets[i-j] and store all four of them in TreeSets[i]. While building a node, you take the total from both tree and apply the operation, you store the total, and you check if it is closer than BestScore (if so you update BestScore and BestSolution with this new total and with the new node). If the total is exactly the value you were looking for, you can stop here.
If you didn't stopped the program by finding an exact solution, there is no such solution, and the closer one is in BestSolution at the end.
Note : You don't have to build a complete tree each time, just build the node with two pointers on other trees.
P.S. : You may avoid to enumerate all the solutions by using the dynamic programming approach, as Glubus said. In this case, it would consist, at each step (i) to remove some solutions that are considered sub-optimal. But with this problem I'm not sure that is possible (except maybe remove the nodes with a total of 0).

Calculate the max length of a regex output string

A user can define the format of an identifier in my system, and this is stored in the d/b as a regex string (for example, "/^\d{6}$/", or a more complicated example of "/^[A-Z]{2}\d{8}$/").
Can anyone suggest how I can calculate the maximum length of the string that the given regex can match (thanks #Ulver)?
Many thanks for reading!

This answer assumes 5 things:
The expressions are simple, as per your examples.
You do not have * or + operators in your expression.
You do not have patterns of the type foo{n, }, where n is some positive, integer value.
Each expression starts with ^ and ends with $.
I am also assuming that each term is followed by the amount of times you expect to match it.
To calculate the amount of characters they match, you could go through the expression and look for 2 patterns:
{n}, which translates to match exactly n times. In this case, extract n.
{n, m}, which translates to match at least n times, and at most m times. In this case, extract m.
Once that you will have all the n and m values, you would simply add them together.
Some more details on the assumptions:
As expressions get more complicated, you will need to keep track of various characters. For instance, ^[A-Z]{2}$ means match 2 upper case letters. Thus, the length of what is matched will be 2. On the other hand, foo{2} means fooo. But afooo and foooobar will also be matched. Thus, you have no control over the lenght of the pattern. also (abc){2} means match abc twice, thus, in this case, you would need to multiply the value of n (the value in the braces) with the length of what ever lies within the brackets which precede it, if any. Of course, you could have nested values.
The * and + operator denote 0 or more, and 1 or more respectively. Thus, there is, theoretically, no limit on the length of whatever it is matched.
Similar to point 2, {n,} means match at least n times. Thus, there is no upper limit.
Similar to point 1, without the ^ and $ anchor, an expression can match any string. The expression foo can match afoo, foobar, foooooooooooooooooooooooo and so on.
I took this assumption for reasons similar to point 1. You could enhance your application to look for [] pairs and count them as 1 character, but I think you could have other caveats.

PHP - How should GPS LOCATION Regular expression look like?

I'm gonna have to learn to use regular expressions soon, but now I just need to know how should a check for "50.080215,14.393983" GPS format look like, thanks, Mike.

You want to find two decimal numbers separated by a comma (and maybe whitespace?) in a string?
$pattern = "/(?P<lat>(\d+(\.\d+)?)),\s*(?P<lon>(\d+(\.\d+)?))/";
This assumes that the fractional portion of each number may not be present if not needed and places no constraints on the number of digits of precision. Depending on your input corpus this may match more often than you want. With a better specification a tighter pattern could be constructed. For example if we specify that latitude will run form -90 to 90 inclusive and longitude will run from -180 to 180 inclusive and both may have up to 6 digits of precision we can construct this pattern:
$pattern = "/(?P<lat>-?([1-8]?[1-9]|[1-9]0)(\.\d{1,6})?)(?P<lon>-?(1?[1-7][1-9]|1?[1-8]0|[1-9]?[0-9])(\.\d{1,6})?)/";
There is a slight bug in this specification in that it will match "90.999999,180.999999" which is outside the hypothetical spec. Correcting this left as an exercise for the reader.

LSA - Latent Semantic Analysis - How to code it in PHP?

I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!

LSA links:
Landauer (co-creator) article on LSA
the R-project lsa user guide
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
U,Sigma,V = singular_value_decomposition(M)
U: w x w
Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V: d x d matrix
Thus U * Sigma * V = M
# you might have to do some transposes depending on how your SVD code
# returns U and V. verify this so that you don't go crazy :)
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
Let s1 = sum(Sigma).
total = 0
for ii in range(len(Sigma)):
val = Sigma[ii]
total += val
if total > .5 * s1:
return ii
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.

This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)

That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).

There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.