Multi line log parsing for service times in PHP/Python

Multi line log parsing for service times in PHP/Python - php

What's the best way to parse a multi line log file that require contextual knowledge from previous lines in php and/or python?
ex.
Date Time ID Call
1/1/10 00:00:00 1234 Start
1/1/10 00:00:01 1234 ServiceCall A Starts
1/1/10 00:00:05 1234 ServiceCall B Starts
1/1/10 00:00:06 1234 ServiceCall A Finishes
1/1/10 00:00:09 1234 ServiceCall B Finishes
1/1/10 00:00:10 1234 Stop
Each log line will have a unique id to bind it to a session but each consecutive set of lines is not guaranteed to be from the same session.
The ultimate goal is to find out how long each transaction took and how long each sub transaction took.
I'd love to use a library if one already exists.

I can think of two different ways of doing this.
1) You can use a finite state machine to process the file line by line. When you hit a Start line, mark the time. When you hit a Stop line with the same ID, diff the time and report.
2) Use PHP's Perl-Compatible Regular Expressions with the m modifier to match all the text from each start/stop line set, then just look at the first and last lines of each match string returned.
In both cases, I would verify the IDs match to prevent against matching different sets.

My first thought would be to create objects each time my parser encountered the start pattern with a new key. I'm assuming,from your example that 1234 is a key such that all log lines which must be correlated together can be mapped to the state of one "thing" (object).
So you see the pattern to start tracking one of these and every time you see a log entry that relates to it you call methods for the type of event (state change) that these subsequent lines represent.
From your example these "log state" objects (for lack of a more apropos term) might contain a list or dictionary (or other container) for each ServiceCall (which I would expect would be another class of objects).
So the overall design would be a parser/dispatcher that reads the log, if the log item relates to some existing object (key) then the item is dispatched to the object which
can then further create its own (ServiceCall or other) objects and/or dispatch events to those or raise exceptions or invoke callbacks or call outs to other functions as needed.
Presumably you also will need to have some collection or final disposition handler which could be called by your log objects when the Stop events are dispatched to them.
I'd guess you'd also want to support some sort or status reporting method so that the application can enumerate all live (uncollected) objects to in response to signals or commands in some other channel (perhaps from a non-blocking check performed by the parser/dispatcher)

Here is a variation on a log parser I wrote a while ago, tailored to your log format. (The general approach tracks pretty closely with Jim Dennis's description, although I used a defaultdict of lists to accumulate all the entries for any given session.)
from pyparsing import Suppress,Word,nums,restOfLine
from datetime import datetime
from collections import defaultdict
def convertToDateTime(tokens):
month,day,year,hh,mm,ss = tokens
return datetime(year+2000, month, day, hh,mm,ss)
# define building blocks for parsing and processing log file entries
SLASH,COLON = map(Suppress,"/:")
integer = Word(nums).setParseAction(lambda t:int(t[0]))
date = integer + (SLASH + integer)*2
time = integer + (COLON + integer)*2
timestamp = date + time
timestamp.setParseAction(convertToDateTime)
# define format of a single line in the log file
logEntry = timestamp("timestamp") + integer("sessionid") + restOfLine("descr")
# summarize calls into single data structure
calls = defaultdict(list)
for logline in log:
entry = logEntry.parseString(logline)
calls[entry.sessionid].append(entry)
# first pass to find start/end time for each call
for sessionid in sorted(calls):
calldata = calls[sessionid]
print sessionid, calldata[-1].timestamp - calldata[0].timestamp
For your data, this prints out:
1234 0:00:10
You can process each session's list of entries with a similar approach to tease apart the sub-transactions.

Related

php : speed up levensthein comparing, 10k + records

In my MySQL table I have the field name, which is unique. However the contents of the field are gathered on different places. So it is possible I have 2 records with a very similar name instead of second one being discarded, due to spelling errors.
Now I want to find those entries that are very similar to another one. For that I loop through all my records, and compare the name to other entries by looping through all the records again. Problem is that there are over 15k records which takes way too much time. Is there a way to do this faster?
this is my code:
for($x=0;$x<count($serie1);$x++)
{
for($y=0;$y<count($serie2);$y++)
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
}

A preamble: such a task will always be time consuming, and there will always be some pairs that slip through.
Nevertheless, a few ideas :
1. actually, the algorithm can be (a bit) improved
assuming that $series1 and $series2 have the same values in the same order, you don't need to loop over the whole second array in the inner loop every time. In this use case you only need to evaluate each value pair once - levenshtein('a', 'b') is sufficient, you don't need levenshtein('b', 'a') as well (and neither do you need levenstein('a', 'a'))
under these assumptions, you can write your function like this:
for($x=0;$x<count($serie1);$x++)
{
for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
2. maybe MySQL is faster
there examples in the net for levenshtein() implementations as a MySQL function. An example on SO is here: How to add levenshtein function in mysql?
If you are comfortable with complex(ish) SQL, you could delegate the heavy lifting to MySQL and at least gain a bit of performance because you aren't fetching the whole 16k rows into the PHP runtime.
3. don't do everything at once / save your results
of course you have to run the function once for every record, but after the initial run, you only have to check new entries since the last run. Schedule a chronjob that once every day/week/month.. checks all new records. You would need an inserted_at column in your table and would still need to compare the new names with every other name entry.
3.5 do some of the work onInsert
a) if the wait is acceptable, do a check once a new record should be inserted, so that you either write it to a log oder give a direct feedback to the user. (A tangent: this could be a good use case for an asynchrony task queue like http://gearman.org/ -> start a new process for the check in the background, return with the success message for the insert immediately)
b) PHP has two other function to help with searching for almost similar strings: metaphone() and soundex() . These functions generate abstract hashes that represent how a string will sound when spoken. You could generate (one or both of) these hashes on each insert, store them as a separate field in your table and use simple SQL functions to find records with similar hashes

The trouble with levenshtein is it only compares string a to string b. I built a spelling corrector once that puts all the strings a into a big trie, and that functioned as a dictionary. Then it would look up any string b in that dictionary, finding all nearest-matching words. I did it first in Fortran (!), then in Pascal. It would be easiest in a more modern language, but I suspect php would not make it easy. Look here.

Searchable date/time durations in SMW

I'm using Mediawiki, with the SMW extension, in a private setting to organize a fictional universe my group is creating. I have some functionality I'd like, and I'd like to know if there is an extension out there, or the possibility of making my own, that would let me do what I want to do.
Basically, in plain english... I want to know what's going on in the universe at a specific point (or duration) in time.
I'd like to give the user an ability to give a date (as simple as a year, or as precise as necessary), or duration, and return every event which has a duration that overlaps.
For example.
Event 1 starts ~ 3200 BCE, and ends ~ 198 BCE
Event 2 starts ~509 BCE and ends ~ 405 CE
Event 3 starts 1/15/419 CE and ends 1/17/419 CE
Event 4 starts ~409 BCE and ends on 2/14/2021 CE
User inputs a date (partial, in this instance) 309 BCE.
Wiki returns Event 1, and Event 4, as the given date is within the duration of both.
This would allow my creators to query a specific date (and hopefully a duration) and discover what events are already taking place, so they can adjust their works according to what is already established. It's a simple conflict checker.
If there's no extension available that can handle this, is there anything like this anywhere I can research? I've never dealt with dates in PHP. I'm a general business coder, I've never done complex applications.

There is no built in “duration” data type in SMW, so the easiest approach would probably be to use one date property for the starting date, and one for the ending date (note that it must be BC/AD, not BCE/CE or similar):
[[Event starts at::3200 BC]]
[[Event ends at::198 BC]]
then you can query for each event that has a starting date before, and an ending date after a certain date:
{{#ask:[[Event starts at::<1000 BC]] [[Event ends at::>1000 BC]]}}
Note that > actually means “greater or equal to” in SMW query syntax.

Managing old elements by timestamp in Redis sorted set

You have a sorted set A in redis, every now and then you add new elements to it, they get sorted by rank e.g. You also have a sorted set B.
Is there a way to check if there are elements in set A that have been there for more then say 20 seconds, and move them to sorted set B
because this checking operation is done very frequently, and list can be very big, iterating through every element in set is a not a good solution. Need fastest one.
Thanks.
UPDATE:
Here is what I was trying to do:
Basically the idea was, imagine you have some kind of game server that matches opponents when they put a fight request. The current design was that every request get's to the set, and the rank/score is the player rank. so that way every 2 players that are near each other in the list are perfect matches. every 5 seconds or so a script get's called that pulls 50 rows from top of set, and matches them 2 by 2 (and removes them). This was working fine, and I think that was a very fast working solution. But then the idea of creating a Bot (AI) players came. so that when player is waiting too long in que, he get's matched with a bot (AI) player. And I cannot figure out a way to see "who is waiting too long" Basically maybe the entire idea was wrong.. so any better ideas are welcome :) Thanks a lot.

If the score in your sorted set is a unix timestamp, you can use zrange to grab the oldest NN items from set A. You can then do your checks, add qualifying entries to set B, then remove them from set A.
If your scoring in set A is not based on timestamp, then you will have to iterate over your set A entirely, or rethink your design. Redis keys do not have an inherent available timestamp of when they are added (which holds doubly true for items in a key such as a sorted set), so it has to be something you specifically create and track. Perhaps if you share more about what you are doing and why we can help with more detail.
Edit:
based on the additions to your questions, I would recommend trying a solution similar to what #akonsu is proposing.
Specifically:
Sorted-Set-A: players by rank just as they are now.
Sorted-Set-B:
uses timestamp as the time the person went into the queue, stores their userid. In other words, when you zadd to SetA with their rank & ID, you zadd to SetB with the timestamp and ID.
When you match players you remove them from both sets. If you pull your set of users to match from SetB using a zrange command to grab the X oldest entries, you will have the time they queued up, in order of their entry (like a FIFO). You then use a zrange command on SetA with their rank +/- whatever rank range you need. If you get a match you proceed with the match by removing them from both sets and moving on.
If there is no suitable opponent in SetA and their timestamp is old enough you match with an AI, then remove them from both sets and move on.
Essentially it is an index queue of users->timestamp. Doing it this way mean shorter queue times for all users as you are now matching them in order of queue length. You still use SetA for matching based on players' rank, but now you are indexing and prioritizing based on time.
The specific implementation may be a bit more interesting than this, but as an overall stratagem I think this fits what you need.

How to merge iCalendar events

I'm trying to merge multiple iCalendars together. I wanna be able to merge overlapping events. So for example, if I have an event Monday at 12pm - 2pm and another event at 1pm - 3pm, I want to end up with an event that runs from 12pm until 3pm.
I'm looking for a simple open-source script that does that in PHP, or just help with the algorithm itself.
Any kind of help is appreciated!

Right -- sadly, I cannot help you with the PHP coding, as i know nothing of PHP (this also means my algorithmic help might just be waay off). However, I know quite my way around algorithms, so I'm going to come up with as many as possible. I'll give each it's reasons for and against, you can take your pick, and hopefully we'll both learn something.
First off, a simplification -- note that when merging more than two ICalendars together, we can merge two, then merge our result with the next etc; meaning our algorithm can just merge two to work.
With that in mind, the conceptually simplest merge I can muster:
Given ICalendars A and B, we will merge them into a new ICalendar C
Initialize C
Pick & remove the earliest event from either A or B, adding it to C.
Do the same, this time "merging the events" if they overlap.
lather, rinse, repeat until both A and B are empty -- C should now contain the merger of A and B.
Actually, this would be close to the best algorithm -- O(n) time, where n is the average number of events per ICalendar; meaning no other methods will be forthcoming...sadly.

This is what I ended up doing if anyone's interested. It may not be the most efficient, but it's good enough for what I'm doing.
Parse the calendars into Event objects (each object has a start time and end time unix timestamps), the class Event should also have a toString() method for exporting.
Store all objects in an array then sort it by start time (ascendingly)
Initialize an array for the final result, let's call it "final_array"
Take the first Event in the array as "A"
Start iterating through the array starting with the next Event, let's name it "B"
If B starts after A ends: Add A to final_array and make B the new A
If B starts before A ends:
If B ends before A ends: Do nothing
If B ends after A ends: Change A's end time to B's end time.
Go back to 5 if you haven't reached the end of the array
For each event in final_array, write event to the new calendar file

Ordering Combinations for Maximum Effectiveness

So recently I was given a problem, which I have been mulling over and am still unable to solve; I was wondering if anyone here could point me in the right direction by providing me with the psuedo code (or at least a rough outline of the pseudo code) for this problem. PS I'll be building in PHP if that makes a difference...
Specs
There are ~50 people (for this example I'll just call them a,b,c... ) and the user is going to group them into groups of three (people in the groups may overlap), and in the end there will be 50-100 groups (ie {a,b,c}; {d,e,f}; {a,d,f}; {b,c,l}...). *
So far it is easy, it is a matter of building an html form and processing it into a multidimensional array
There are ~15 time slots during the day (eg 9:00AM, 9:20AM, 9:40AM...). Each of these groups needs to meet once during the day. And during one time slot the person cannot be double booked (ie 'a' cannot be in 2 different groups at 9:40AM).
It gets tricky here, but not impossible, my best guess at how to do this would be to brute force it (pick out sets of groups that have no overlap (eg {a,b,c}; {l,f,g}; {q,n,d}...) and then just put each into a time slot
Finally, the schedule which I output needs to be 'optimized', by that I mean that 'a' should have minimal time between meetings (so if his first meeting is at 9:20AM, his second meeting shouldn't be at 2:00PM).
Here's where I am lost, my only guess would be to build many, many schedules and then rank them based on the average waiting time a person has from one meeting to the next
However My 'solutions' (I hesitate to call them that) require too much brute force and would take too long to create. Are there simpler, more elegant solutions?

These are the table laid out, modified for your scenerio
+----User_Details------+ //You may or may not need this
| UID | Particulars... |
+----------------------+
+----User_Timeslots---------+ //Time slots per collumn
| UID | SlotNumber(bool)... | //true/false if the user is avaliable
+---------------------------+ //SlotNumber is replaced by s1, s2, etc
+----User_Arrangements--------+ //Time slots per collumn
| UID | SlotNumber(string)... | //Group session string
+-----------------------------+
Note: That the string in the Arrangement table, was in the following format : JSON
'[12,15,32]' //From SMALLEST to BIGGEST!
So what happens in the arrangement table, was that a script [Or an EXCEL column formula] would go through each slot per session, and randomly create a possible session. Checking all previous sessions for conflicts.
/**
* Randomise a session, in which data is not yet set
**/
function randomizeSession( sesionID ) {
for( var id = [lowest UID], id < [highest UID], id++ ) {
if( id exists ) {
randomizeSingleSession( id, sessionID );
} //else skips
}
}
/**
* Randomizes a single user in a session, without conflicts in previous sessions
**/
function randomizeSingleSession( id, sessionID ) {
convert sessionID to its collumns name =)
get the collumns name of all ther previous session
if( there is data, false, or JSON ) {
Does nothing (Already has data)
}
if( ID is avaliable in time slot table (for this session) ) {
Get all IDs who are avaliable, and contains no data this session
Get all the UID previous session
while( first time || not yet resolved ) {
Randomly chose 2
if( there was conflict in UID previous session ) {
try again (while) : not yet resolved
} else {
resolved
}
}
Registers all 3 users as a group in the session
} else {
Set session result to false (no attendance)
}
}
You will realize the main part of the assignment of groups is via randomization. However, as the amount of sessions increases. There will be more and more data to check against for conflicts. Resulting to a much slower performance. However large being, ridiculously large, to an almost perfect permutation/combination formulation.
EDIT:
This setup will also help ensure, that as long as the user is available, they will be in a group. Though you may have pockets of users, having no user group (a small number). These are usually remedied by recalculating (for small session numbers). Or just manually group them together, even if it is a repeat. (having a few here and there does not hurt). Or alternatively in your case, along with the remainders, join several groups of 3's to form groups of 4. =)
And if this can work for EXCEL with about 100+ ppl, and about 10 sessions. I do not see how this would not work in SQL + PHP. Just that the calculations may actually take some considerable time both ways.

Okay, for those who just join in on this post, please read through all the comments to the question before considering the contents of this answer, as this will very likely fly over your head.
Here is some pseudo code in PHP'ish style:
/* Array with profs (this is one dimensional here for the show, but I assume
it will be multi-dimensional, filled with availability and what not;
For the sake of this example, let me say that the multi-dimensional array
contains the following keys: [id]{[avail_from],[avail_to],[last_ses],[name]}*/
$profs = array_fill(0, $prof_num, "assoc_ids");
// Array with time slots, let's say UNIX stamps of begin time
$times = array_fill(0, $slot_num, "time");
// First, we need to loop through all the time slots
foreach ($times as $slot) {
// See when session ends
$slot_end = $slot + $session_time;
// Now, run through the profs to see who's available
$avail_profs = array(); // Empty
foreach ($profs as $prof_id => $data) {
if (($data['avail_from'] >= $slot) && ($data['avail_to'] >= $slot_end)) {
$avail_prof[$prof_id] = $data['last_ses'];
}
}
/* Reverse sort the array so that the highest numbers (profs who have been
waiting the longest) will be up top */
arsort($avail_profs);
$profs_session = array_slice($avail_profs, 0, 3);
$profs_session_names = array(); // Empty
// Reset the last_ses counters on those profs
foreach ($profs_session as $prof_id => $last_ses) {
$profs[$prof_id]['last_ses'] = 0;
$profs_session_names[0] = $profs[$prof_id]['name'];
}
// Now, loop through all profs to add one to their waiting time
foreach ($profs as $prof_id = > $data) {
$profs[$prof_id]['last_ses']++;
}
print(sprintf('The %s session will be held by: %s, $s, and %s<br />', $slot,
$profs_session_names[0], $profs_session_names[1],
$profs_session_names[2]);
unset ($profs_session, $profs_session_names, $avail_prof);
}
That should print something like:
The 9:40am session will be held by: C. Hicks, A. Hole, and B.E.N. Dover

I see an object model consisting of:
Panelists: a fixed repository of of your the panelists (Tom, Dick, Harry, etc)
Panel: consists of X Panelists (X=3 in your case)
Timeslots: a fixed repository of your time slots. Assuming fixed duration and only occurring on a single day, then all you need is track is start time.
Meeting: consists of a Panel and Timeslot
Schedule: consists of many Meetings
Now, as you have observed, the optimization is the key. To me the question is: "Optimized with respect to what criteria?". Optimal for Tom might means that the Panels on which he is a member lay out without big gaps. But Harry's Panels may be all over the board. So, perhaps for a given Schedule, we compute something like totalMemberDeadTime (= sum of all dead time member gaps in the Schedule). An optimal Schedule is the one that is minimal with respect to this sum
If we are interested in computing a technically optimal schedule among the universe of all schedules, I don't really see an alternative to brute force .
Perhaps that universe of Schedules does not need to be as big as might first appear. It sounds like the panels are constituted first and then the issue is to assign them to Meetings which them constitute a schedule. So, we removed the variability in the panel composition; the full scope of variability is in the Meetings and the Schedule. Still, sure seems like a lot of variability there.
But perhaps optimal with respect to all possible Schedules is more than we really need.
Might we define a Schedule as acceptable if no panelist has total dead time more than X? Or failing that, if no more than X panelists have dead time more than X (can't satisfy everyone, but keep the screwing down to a minimum)? Then the user could assign meeting for panels containing the the more "important" panelists first, and less-important guys simply have to take what they get. Then all we have to do is fine a single acceptable Schedule
Might it be sufficient for your purposes to compare any two Schedules? Combined with an interface (I'm seeing a drag-and-drop interface, but that's clearly beyond the point) that allows the user to constitute a schedule, clone it into a second schedule, and tweak the second one, looking to reduce aggregate dead time until we can find one that is acceptable.
Anyway, not a complete answer. Just thinking out loud. Hope it helps.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.