I'm looking to find an algorithm that I can implement in PHP to get the natural log() of an integer number using arbitrary precision maths. I'm limited by the PHP overlay library of the GMP library (see http://php.net/manual/en/ref.gmp.php for available GMP functions in PHP.)
If you know of a generic algorithm that can be translated into PHP, that would also be a useful starting point.
PHP supports a native log() function, I know, but I want to be able to work this out using arbitrary precision.
Closely related is getting an exp() function. If my schoolboy Maths serves me right, getting one can lead to the other.
Well you would have the Taylor series, that can be rewritten for better convergence
To transform this nice equality into an algorithm, you have to understand how a converging series work : each term is smaller and smaller. This decrease happens fast enough so that the total sum is a finite value : ln(y).
Because of nice properties of the real numbers, you may consider the sequence converging to ln(y) :
L(1) = 2/1 * (y-1)/(y+1)
L(2) = 2/1 * (y-1)/(y+1) + 2/3 * ( (y-1)/(y+1) )^3
L(3) = 2/1 * (y-1)/(y+1) + 2/3 * ( (y-1)/(y+1) )^3 + 2/5 * ( (y-1)/(y+1) )^5
.. and so on.
Obviously, the algorithm to compute this sequence is easy :
x = (y-1)/(y+1);
z = x * x;
L = 0;
k = 0;
for(k=1; x > epsilon; k+=2)
{
L += 2 * x / k;
x *= z;
}
At some point, your x will become so small that it will not contribute to the interesting digits of L anymore, instead only modifying the much smaller digits. When these modifications start to be too insignificant for your purposes, you may stop.
Thus if you want to achieve a precision 1e^-20, set epsilon to be reasonably smaller than that, and you're good to go.
Don't forget to factorize within the log if you can. If it's a perfect square for example, ln(a²) = 2 ln(a)
Indeed, the series will converge faster when (y-1)/(y+1) is smaller, thus when y is smaller (or rather, closer to 1, but that should be equivalent if you're planning on using integers).
I want to get the values of pixels in a circumference in an image, and plot them.
I know Circumference is C=2*pi*radius, but I am uncertain about how to iterate through all the points in a circle to get the pixel data.
To get a single pixel, this would work. But I need to get pixel values along a circles circumference. how should I iterate through to get that data?
$pixel=getPixel($image, $x, $y);
look at answers here 3D sphere boundary
it is 3D equivalent of your problem so it may help but if you need also the pixel order to be right then most likely is this not for you.
If you want speed use Bresenham
but for newbies it can be difficult to implement and even more to understand
if you want simplicity instead (or for start) then:
use parametric circle equation
x=x0+r*cos(t)
y=y0+r*sin(t)
which get you pixel position for the circle boundary while t= <0,2*pi) [rad] use deg or rad according to your sin,cos functions
pixels only
circle circumference is 2*pi*r [pixels] so step for parameter t should be small enough to reach as many points
dt <= 2*pi/2*pi*r // whole circle / number of pixels
dt <= 1/r // let use half of that for safety
so for extracting points use this C++ code:
int x0=...,y0=...,r=...; // input values
int xx=x0+r+r,yy=y0,x,y;
double t,dt=0.5/r;
for (t=0.0;t<2.0*M_PI;t+=dt)
{
x=x0+int(double(double(r)*cos(t)));
y=y0+int(double(double(r)*sin(t)));
if ((xx!=x)&&(yy!=y)) // check if the coordinates crossed pixel barrier
{
xx=x; yy=y;
// here do what you need to do with pixel x,y
}
}
if there are holes inside your perimeter then lower the dt more. The less it is the smaller step you use but it also slows down the whole thing. You can have r as double or has its copy to avoid 2 int/double conversions. xx,yy are last used pixel coordinates to avoid processing single pixel multiple times.
At start it is set point that is not inside circle for safety. if r==0 then you should set dt to some safety value like dt=M_PI;
One way to do this would be to copy the low level code used to plot the pixels when creating the image of a circle on the screen. This works by incrementing (or decrementing) one of the co-ordinates and then adjusting the other one so as to keep the same distance from the centre of the circle. To ensure it is symmetrical, you make sure that each octant of the circle is plotted in exactly the same way. Details at http://www.asksatyam.com/2011/01/bresenhams-circle-algorithm_22.html (Or http://en.wikipedia.org/wiki/Midpoint_circle_algorithm of course).
I believe I need a solution using PHP for the following problem. Let's start and say we have a map, that width is 100000 and height 100000.
I'd have a region into that map, designed by many X / Y / Z coordinates. something like:
{{-56000;190073;-4509};{-54955;190073;-4509};{-54954;190638;-4509}{-56000;190638;-4509}}
That's 4 points forming a square on our map. But the zones can be defined by 10+ points, so nothing like squares.
Now I'd need a way to generate N different random coordinates that are INSIDE that region.
I don't know where and how to start with this problem, but I know how to use PHP. Just actually lacking the theory part. What algorithm could I use?
Use the rand function to generate x & y coordinates n the range specified by your bounds:
$x = rand($min_x, $max_x);
$y = rand($min_y, $max_y);
I'm not sure what range you want to use for your z coordinate.
My primary question is:
Is this alot of loops?
while ($decimals < 50000 and $remainder != "0") {
$number = floor($remainder/$currentdivider); //Always round down! 10/3 =3, 10/7 = 1
$remainder = $remainder%$currentdivider; // 10%3 =1, 10%1
$thisnumber = $thisnumber . $number;
$remainder = $remainder . 0; //10
$decimals += 1;
}
Or could I fit more into it? -without the server crashing/lagging.
I'm just wondering,
Also is there a more effiecent way of doing the above? (e.g. finidng out that 1/3 = 0.3 to 50,000 decimals.)
Finally:
I'm doing this for a pi formulae the (1 - 1/3 + 1/5 - 1/7 etc.) one,
And i'm wondering if there is a better one. (In php)
I have found one that finds pi to 2000 in 4 seconds.
But thats not what I want. I want an infinite series that converges closer to Pi
so every refresh, users can view it getting closer...
But obv. converging using the above formulae takes ALONG time.
Is there any other 'loop' like Pi formulaes (workable in php) that converge faster?
Thanks alot...
Here you have several formulas for calculating Pi:
http://mathworld.wolfram.com/PiFormulas.html
All of them are "workable" in PHP, like in any other programming language. A different question is how fast they are or how difficult they are to implement.
If the formulas converge faster or slower, it's a Math question, not about programming, so I can't help you. I can tell you that as a rule of a thumb, the less nested loops you put, the faster will be your algorithm (this is a general rule, don't take it as the absolute truth!)
Anyway, since the digits of Pi are known until a certain digit, why don't you copy it into a file and then just index it? That will be extremely fast :)
You can check previous answers to similar questions:
How can pi be calculated to a set number of digits in PHP?
https://stackoverflow.com/questions/3045020/which-is-the-best-formulae-to-find-pi
Check http://mathworld.wolfram.com/PiIterations.html (taken from the last answer). Those formulaes are using iterations and can therefor be implemented using a loop.
You should use google and search for "php implementation xxxxxxx" (where xxxxxx stands for the algorithm name you want to search for).
EDIT: Here is an implementation of Vietas formula using a while-loop in php.
I would like to implement Latent Semantic Analysis (LSA) in PHP in order to find out topics/tags for texts.
Here is what I think I have to do. Is this correct? How can I code it in PHP? How do I determine which words to chose?
I don't want to use any external libraries. I've already an implementation for the Singular Value Decomposition (SVD).
Extract all words from the given text.
Weight the words/phrases, e.g. with tf–idf. If weighting is too complex, just take the number of occurrences.
Build up a matrix: The columns are some documents from the database (the more the better?), the rows are all unique words, the values are the numbers of occurrences or the weight.
Do the Singular Value Decomposition (SVD).
Use the values in the matrix S (SVD) to do the dimension reduction (how?).
I hope you can help me. Thank you very much in advance!
LSA links:
Landauer (co-creator) article on LSA
the R-project lsa user guide
Here is the complete algorithm. If you have SVD, you are most of the way there. The papers above explain it better than I do.
Assumptions:
your SVD function will give the singular values and singular vectors in descending order. If not, you have to do more acrobatics.
M: corpus matrix, w (words) by d (documents) (w rows, d columns). These can be raw counts, or tfidf or whatever. Stopwords may or may not be eliminated, and stemming may happen (Landauer says keep stopwords and don't stem, but yes to tfidf).
U,Sigma,V = singular_value_decomposition(M)
U: w x w
Sigma: min(w,d) length vector, or w * d matrix with diagonal filled in the first min(w,d) spots with the singular values
V: d x d matrix
Thus U * Sigma * V = M
# you might have to do some transposes depending on how your SVD code
# returns U and V. verify this so that you don't go crazy :)
Then the reductionality.... the actual LSA paper suggests a good approximation for the basis is to keep enough vectors such that their singular values are more than 50% of the total of the singular values.
More succintly... (pseudocode)
Let s1 = sum(Sigma).
total = 0
for ii in range(len(Sigma)):
val = Sigma[ii]
total += val
if total > .5 * s1:
return ii
This will return the rank of the new basis, which was min(d,w) before, and we'll now approximate with {ii}.
(here, ' -> prime, not transpose)
We create new matrices: U',Sigma', V', with sizes w x ii, ii x ii, and ii x d.
That's the essence of the LSA algorithm.
This resultant matrix U' * Sigma' * V' can be used for 'improved' cosine similarity searching, or you can pick the top 3 words for each document in it, for example. Whether this yeilds more than a simple tf-idf is a matter of some debate.
To me, LSA performs poorly in real world data sets because of polysemy, and data sets with too many topics. It's mathematical / probabilistic basis is unsound (it assumes normal-ish (Gaussian) distributions, which don't makes sense for word counts).
Your mileage will definitely vary.
Tagging using LSA (one method!)
Construct the U' Sigma' V' dimensionally reduced matrices using SVD and a reduction heuristic
By hand, look over the U' matrix, and come up with terms that describe each "topic". For example, if the the biggest parts of that vector were "Bronx, Yankees, Manhattan," then "New York City" might be a good term for it. Keep these in a associative array, or list. This step should be reasonable since the number of vectors will be finite.
Assuming you have a vector (v1) of words for a document, then v1 * t(U') will give the strongest 'topics' for that document. Select the 3 highest, then give their "topics" as computed in the previous step.
This answer isn't directly to the posters' question, but to the meta question of how to autotag news items. The OP mentions Named Entity Recognition, but I believe they mean something more along the line of autotagging. If they really mean NER, then this response is hogwash :)
Given these constraints (600 items / day, 100-200 characters / item) with divergent sources, here are some tagging options:
By hand. An analyst could easily do 600 of these per day, probably in a couple of hours. Something like Amazon's Mechanical Turk, or making users do it, might also be feasible. Having some number of "hand-tagged", even if it's only 50 or 100, will be a good basis for comparing whatever the autogenerated methods below get you.
Dimentionality reductions, using LSA, Topic-Models (Latent Dirichlet Allocation), and the like.... I've had really poor luck with LSA on real-world data sets and I'm unsatisfied with its statistical basis. LDA I find much better, and has an incredible mailing list that has the best thinking on how to assign topics to texts.
Simple heuristics... if you have actual news items, then exploit the structure of the news item. Focus on the first sentence, toss out all the common words (stop words) and select the best 3 nouns from the first two sentences. Or heck, take all the nouns in the first sentence, and see where that gets you. If the texts are all in english, then do part of speech analysis on the whole shebang, and see what that gets you. With structured items, like news reports, LSA and other order independent methods (tf-idf) throws out a lot of information.
Good luck!
(if you like this answer, maybe retag the question to fit it)
That all looks right, up to the last step. The usual notation for SVD is that it returns three matrices A = USV*. S is a diagonal matrix (meaning all zero off the diagonal) that, in this case, basically gives a measure of how much each dimension captures of the original data. The numbers ("singular values") will go down, and you can look for a drop-off for how many dimensions are useful. Otherwise, you'll want to just choose an arbitrary number N for how many dimensions to take.
Here I get a little fuzzy. The coordinates of the terms (words) in the reduced-dimension space is either in U or V, I think depending on whether they are in the rows or columns of the input matrix. Off hand, I think the coordinates for the words will be the rows of U. i.e. the first row of U corresponds to the first row of the input matrix, i.e. the first word. Then you just take the first N columns of that row as the word's coordinate in the reduced space.
HTH
Update:
This process so far doesn't tell you exactly how to pick out tags. I've never heard of anyone using LSI to choose tags (a machine learning algorithm might be more suited to the task, like, say, decision trees). LSI tells you whether two words are similar. That's a long way from assigning tags.
There are two tasks- a) what are the set of tags to use? b) how to choose the best three tags?. I don't have much of a sense of how LSI is going to help you answer (a). You can choose the set of tags by hand. But, if you're using LSI, the tags probably should be words that occur in the documents. Then for (b), you want to pick out the tags that are closest to words found in the document. You could experiment with a few ways of implementing that. Choose the three tags that are closest to any word in the document, where closeness is measured by the cosine similarity (see Wikipedia) between the tag's coordinate (its row in U) and the word's coordinate (its row in U).
There is an additional SO thread on the perils of doing this all in PHP at link text.
Specifically, there is a link there to this paper on Latent Semantic Mapping, which describes how to get the resultant "topics" for a text.