Word Occurrences count in PHP - scaling and optimization

Word Occurrences count in PHP - scaling and optimization - php

You have a large collection of very short (<100 characters) to medium-long (~5k characters) text items in a Mysql table. You want a solid and fast algorithm to obtain an aggregated word occurrency count in all the items.
Items selected can be as few as one (or none) and as many as 1M+, so the size of the text that has to be analyzed can vary wildly.
The current solution involves:
reading text from the selected records into a text variable
preg_replacing out everything that is not a "word" getting a somewhat cleaned up, shorter text (hashtags, #mentions, http(s):// links, numeric sequences such as phone numbers, etc. are not "words" and are parsed out)
exploding what's left into a "buffer" array
taking out everything that's shorter than two words and dumping everything else in a "master" array
the resulting array is then transformed with array_count_values, sorted (arsort), spliced a first time to have a more manageable array size, then parsed against stopwords lists in several languages, processed and spliced some more and finally output in JSON form as the list of the 50 most frequent words.
By tracing timing on the various steps of the operation sequence, the apparent bottleneck is in the query at first but as the item count increases, it rapidly moves to the array_count_values function (everything after that is more or less immediate).
On a ~ 10k items test run the total time for execution is ~3s from beginning to end; with 20k items it takes ~7s.
A (rather extreme but not impossible) case with 1.3M items takes 1m for the mysql query, and is able to parse roughly ~75k items per minute (so 17 minutes-ish is the estimate).
The result should be visualized as a result of an AJAX call, so with this kind of timing the UX is evidently disrupted. I'm looking for ways to optimize everything as much as possible. A 30s load time may be acceptable (however little realistic), 10 minutes (or more) is not.
I've tried batch processing by array_count_values-ing the chunks and then adding the resulting count arrays to a master array by key, but that helps only so much - the sum of the parts is equal (or slightly larger) than the total in terms of timing.
I only need the 50 most frequent occurrences at the top of the list, so there's possibly some room for improvement by cutting a few corners.

dump the column(s) into a text file. Suggest SELECT ... INTO OUTFILE 'x.txt'...
(if on linux etc): tr -s '[:blank:]' '\n' <x.txt | sort | uniq -c
To get the top 50:
tr -s '[:blank:]' '\n' <x.txt | sort | uniq -c | sort -nbr | head -50
If you need to tweak the definition of "word", such as dealing with punctuation, see the documentation on tr. You may have trouble with contractions versus phrases in single-quotes.
Some of the cleansing of the text can (and should) be done in SQL:
WHERE col NOT LIKE 'http%'
Some can (and should) be done in the shell script.
For example, this will get rid of #variables and phone numbers and 0-, 1-, and 2-character "words":
egrep -v '^#|^([-0-9]+$|$|.$|..)$'
That can be in the pipe stream just before the first sort.
The only limitation is disk space. Each step in the script is quite fast.
Test case:
The table: 176K rows of mostly text (including some linefeeds)
$ wc x3.txt
3,428,398 lines 31,925,449 'words' 225,339,960 bytes
$ time tr -s '[:blank:]' '\n' <x.txt | egrep -v '^#|^([-0-9]+$|$|.$|..)$' |
sort | uniq -c | sort -nbr | head -50
658569 the
306135 and
194778 live
175529 rel="nofollow"
161684 you
156377 for
126378 that
121560 this
119729 with
...
real 2m16.926s
user 2m23.888s
sys 0m1.380s
Fast enough?
I was watching it via top. It seems that the slowest part is the first sort. The SELECT took under 2 seconds (however, the table was probably cached in RAM).

Related

PHP processing time and CPU usage with million rows comparing

I have a tool witch compare one string with, on average - 250k strings from database.
Two tables are used during compare process - categories and categories_strings. In string table there is around 2.5 million rows while pivot - categories_string contains of 7 million rows.
My query is pretty simple, selecting strings columns, joining pivot table, adding where clause to specify category and setting limit of 10 000.
I run this query in a loop, every batch is 10 000 strings. To execute whole script faster I use Seek Method instead of MySQL offset which was a way too slow on huge offsets.
Then, comparing by common algorithms such us simple text, levenshtein etc. is perfomed on each batch. This part is simple.
The question starts here.
On my laptop (lenovo x230) whole process for i.e. 250k string compared takes: 7,4 seconds to load SQL, 13,3 seconds to compare all rows. And then 0,1 second sorting and transforming for view.
I've also small dedicated server. Same PHP version, same MySQL. Web server doesn't matter, as I run it from command line right now. As on my laptop it takes +- 20 seconds in total, on the server it is... 120 seconds.
So, what is the most important factor for a long running PHP program which have impact on execution time? All I can think of is CPU, which on the dedicated server is worse, it is Intel(R) Atom(TM) CPU N2800 # 1.86GHz. Memory comsumption is pretty low, about 2-4%. CPU usage, however is around 60% on my laptop and 99,7 - 100% on the server.
Is CPU the most importing factor in this case? Is there any way to split it for example into several processes which in total would take less? Despite all, how to monitor CPU usage, which part of script is most consuming.

Add margin to X-axis on RRDtool graph

A customer of mine wants to get better insight of the dBm values that an optical SFP sends and receives. Every 5 minute I poll these values and update the values in an RRD file. The RRD graph I create with the RRD file as its source is created in the following way:
/usr/bin/rrdtool graph /var/www/customer/tmp/ZtIKQOJZFf.png --alt-autoscale
--rigid --start now-3600 --end now --width 800 --height 350
-c BACK#EEEEEE00 -c SHADEA#EEEEEE00 -c SHADEB#EEEEEE00 -c FONT#000000
-c GRID#a5a5a5 -c MGRID#FF9999 -c FRAME#5e5e5e -c ARROW#5e5e5e -R normal
--font LEGEND:8:'DejaVuSansMono' --font AXIS:7:'DejaVuSansMono' --font-render-mode normal
-E COMMENT:'Bits/s Last Avg Max \n'
DEF:sfptxpower=/var/www/customer/rrd/sfpdbm.rrd:SFPTXPOWER:AVERAGE
DEF:sfprxpower=/var/www/customer/rrd/sfpdbm.rrd:SFPRXPOWER:AVERAGE
DEF:sfptxpower_max=/var/www/customer/rrd/sfpdbm.rrd:SFPTXPOWER:MAX
DEF:sfprxpower_max=/var/www/customer/rrd/sfpdbm.rrd:SFPRXPOWER:MAX
LINE1.25:sfptxpower#000099:'tx ' GPRINT:sfptxpower:LAST:%6.2lf%s\g
GPRINT:sfptxpower:AVERAGE:%6.2lf%s\g GPRINT:sfptxpower_max:MAX:%6.2lf%s\g
COMMENT:'\n' LINE1.25:sfprxpower#B80000:'rx '
GPRINT:sfprxpower:LAST:%6.2lf%s\g GPRINT:sfprxpower:AVERAGE:%6.2lf%s\g
GPRINT:sfprxpower_max:MAX:%6.2lf%s\g COMMENT:'\n'
which draws a graph just how it is supposed to be. However, the graph that comes out of it is not very readable as both tx and rx values make up the border of the graph:
My question therefor is: Is it possible to add some sort of margin (like a percentage (%)?) to the X-axis so that both lines can be easily seen on the graph?

RRDTool graph has four different scaling modes you can select via options: autoscale (the default), alt-autoscale, specified-expandable, and specified-rigid.
Autoscale - this scales the graph to fit the data, using the default algorythm. You choose this using the --autoscale option (or by omitting the other scaling options). This will try to make the Y-axis range limited by common ranges -- in your case, probably 0 to -5. Sometimes it works well, sometimes it doesnt.
Alt-Autoscale - this is like autoscale, but clings closely to the actual data max and min. You choose this with --alt-autoscale and it is what you are currently using.
Specified, expandable - This lets you specify a max/min for the Y axis, but they are expanded out if the data are outside this range. You choose this by specifying --upper-limit and/or --lower-limit but NOT --rigid. In your case, if you give an upper limit of -2 and a lower limit of -4 it would look good, and the graph range would be expanded if your data go to -5.
Specified, rigid - This is like above, but the limits are fixed where you specify them. If the data go outside this range, then the line is not displayed. You specify this by using --rigid when giving an upper or lower bound.
Note that with the Specified types, you can specify only one end of the range so as to get a specified type at one end and continue to use an autoscale type for the other.
From this, I would suggest that you remove the --rigid and --alt-autoscale options, and instead specify --upper-limit -2 and --lower-limit -4 to display your data more neatly. If they leave this range then you will continue to get a graph as currently - whether this works or not depends on the nature of the data and how much they normally can vary.

Detect points of change in sensor data

I have lots of sensor data from which I need to be able to detect changes reliably. Basically it comes from water level sensor in remote client. It's using accelerometer & float to get the water level. My problem is that the data can be noisy sometimes (it varies by 2-5 units per measurement) and sometimes I need to detect changes as low as 7-9 units.
When I'm graphing the data it's quite obvious for human eye that there's a change but how would I go at it programming wise? Now I'm just trying to detect changes bigger than x programmatically but it's not too reliable. I've attached a sample graph and pointed the changes with arrows. The huge changes in the beginning are just testing, so it's not normal behaviour for data.
The data is in MYSQL database and the code is in PHP so if you could point me to right direction I'd highly appreciate it!
EDIT: Also there can be some spikes in the data which are not considered valid but rather a mistake in the data.
EDIT: Example data can be found from http://pastebin.com/x8C9AtAk
The algorithm would need to run every 30 mins or so and should be able to detect changes within the last 2-4 pings. Each ping is in 3-5min interval.

I made some awk that you, or someone else, might like to experiment with. I average the last 10 (m) samples excluding the current one, and also average the last 2 samples (n) and then calculate the difference between the two and output a message if the absolute difference exceeds a threshold.
#!/bin/bash
awk -F, '
# j will count number of samples
# we will average last m samples and last n samples
BEGIN {j=0;m=10;n=2}
{d[j]=$3;id[j++]=$1" "$2} # Store this point in array d[]
END { # Do this at end after reading all samples
for(i=m-1;i<j;i++){ # Iterate over all samples, except first few while building average
totlastm=0 # Calculate average over last m not incl current
for(k=m;k>0;k--)totlastm+=d[i-k]
avelastm=totlastm/m # Average = total/m
totlastn=0 # Calculate average over last n
for(k=n-1;k>=0;k--)totlastn+=d[i-k]
avelastn=totlastn/n # Average = total/n
dif=avelastm-avelastn # Calculate difference between ave last m and ave last n
if(dif<0)dif=-dif # Make absolute
mesg="";
if(dif>4)mesg="<-Change detected"; # Make message if change large
printf "%s: Sample[%d]=%d,ave(%d)=%.2f,ave(%d)=%.2f,dif=%.2f%s\n",id[i],i,d[i],m,avelastm,n,avelastn,dif,mesg;
}
}
' <(tr -d '"' < levels.txt)
The last bit <(tr...) just removes the double quotes before sending the file levels.txt to awk.
Here is an excerpt from the output:
18393344 2014-03-01 14:08:34: Sample[1319]=343,ave(10)=342.00,ave(2)=342.00,dif=0.00
18393576 2014-03-01 14:13:37: Sample[1320]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18393808 2014-03-01 14:18:39: Sample[1321]=343,ave(10)=342.10,ave(2)=343.00,dif=0.90
18394036 2014-03-01 14:23:45: Sample[1322]=342,ave(10)=342.30,ave(2)=342.50,dif=0.20
18394266 2014-03-01 14:28:47: Sample[1323]=341,ave(10)=342.20,ave(2)=341.50,dif=0.70
18394683 2014-03-01 14:38:16: Sample[1324]=346,ave(10)=342.20,ave(2)=343.50,dif=1.30
18394923 2014-03-01 14:43:17: Sample[1325]=348,ave(10)=342.70,ave(2)=347.00,dif=4.30<-Change detected
18395167 2014-03-01 14:48:25: Sample[1326]=345,ave(10)=343.20,ave(2)=346.50,dif=3.30
18395409 2014-03-01 14:53:28: Sample[1327]=347,ave(10)=343.60,ave(2)=346.00,dif=2.40
18395645 2014-03-01 14:58:30: Sample[1328]=347,ave(10)=343.90,ave(2)=347.00,dif=3.10

The right way to go about problems of this kind is to build a model of the phenomenon of interest and also a model of the noise process, and then make inferences about the phenomenon given some data. These inferences are necessarily probabilistic. The general computation you need to carry out is P(H_k | data) = P(data | H_k) P(H_k) / (sum_k (P(data | H_k) P(H_k)) (a generalized form of Bayes rule) where the H_k are all the hypotheses of interest, such as "step of magnitude at time " or "noise of magnitude ". In this case there might be a large number of plausible hypotheses, covering all possible magnitudes and times. You might need to limit the range of hypotheses considered in order to make the problem tractable, e.g. only looking back a certain number of time steps.

PHP Massive Memory Usage (30+ GB) Using Associative Arrays

I'm building a script which requires counting the number of occurances of each word in each file, out of about 2000 files, each being around 500KB.
So that is 1GB of data, but MySQL usage goes over 30+ GB (then it runs out and ends).
I've tracked down the cause of this to my liberal use of associative arrays, which looks like this:
for($runc=0; $runc<$numwords; $runc++)
{
$word=trim($content[$runc]);
if ($words[$run][$word]==$wordacceptance && !$wordused[$word])
{
$wordlist[$onword]=$word;
$onword++;
$wordused[$word]=true;
}
$words[$run][$word]++; // +1 to number of occurances of this word in current category
$nwords[$run]++;
}
$run is the current category.
You can see that to count the words's I'm just adding them to the associative array $words[$run][$word]. Which increases with each occurance of each word in each category of files.
Then $wordused[$word] is used to make sure that a word doesn't get added twice to the wordlist.
$wordlist is a simple array (0,1,2,3,etc.) with a list of all different words used.
This eats up gigantic amounts of memory. Is there a more efficient way of doing this? I was considering of using a MySQL memory table, but I want to do the whole thing in PHP so it's fast and portable.

Have you tried the builtin function for counting words?
http://hu2.php.net/manual/en/function.str-word-count.php
EDIT: Or use explode to get an array of words, trim all with array_walk, then sort, and then go though with a for, and count the occurances, and if a new word comes in the list you can flush the number of occurances, so no need for accounting which word was previously.

php - how do I display 5 results from possible 50 randomly but ensure all results are displayed equal amount

In php - how do I display 5 results from possible 50 randomly but ensure all results are displayed equal amount.
For example table has 50 entries.
I wish to show 5 of these randomly with every page load but also need to ensure all results are displayed rotationally an equal number of times.
I've spent hours googling for this but can't work it out - would very much like your help please.

please scroll down for "biased randomness" if you dont want to read.
In mysql you can just use SeleCT * From table order by rand() limit 5.
What you want just does not work. Its logically contradicting.
You have to understand that complete randomness by definition means equal distribution after an infinite period of time.
The longer the interval of selection the more evenly the distribution.
If you MUST have even distribution of selection for example every 24h interval, you cannot use a random algorithm. It is by definition contradicting.
It really depends no what your goal is.
You could for example take some element by random and then lower the possibity for the same element to be re-chosen at the next run. This way you can do a heuristic that gives you a more evenly distribution after a shorter amount of time. But its not random. Well certain parts are.
You could also randomly select from your database, mark the elements as selected, and now select only from those not yet selected. When no element is left, reset all.
Very trivial but might do your job.
You can also do something like that with timestamps to make the distribution a bit more elegant.
This could probably look like ORDER BY RAND()*((timestamps-min(timestamps))/(max(timetamps)-min(timestamps))) DESC or something like that. Basically you could normalize the timestamp of selection of an entry using the time interval window so it gets something between 0 and 1 and then multiply it by rand.. then you have 50% fresh stuff less likely selected and 50% randomness... i am not sure about the formular above, just typed it down. probably wrong but the principle works.
I think what you want is generally referred to as "biased randomness". there are a lot of papers on that and some articles on SO. for example here:
Biased random in SQL?

Copy the 50 results to some temporary place (file, database, whatever you use). Then everytime you need random values, select 5 random values from the 50 and delete them from your temporary data set.
Once your temporary data set is empty, create a new one copying the original again.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.