Fuzzy date match - php

I have a mysql db of clients and crawled a website retrieving all the reviews for the past few years. Now I am trying to match those reviews up with the clients so I can email them. The problem is that the review site allowed them to enter anything they wanted for the name, so in some cases I have full first name and last initial, and in some cases first initial and last full name. It also gives an approximate time it was posted such as "1 week ago", "6 months ago" and so on which we already have converted to an approximate date.
Now I need to try matching those up to the clients. Seems the best way would be to do a fuzzy search on the names, and then once I find all John B% I look for the one with a job completion date nearest the posting of the review naturally eliminating anything that was posted before jobs were completed.
I put together a small sample dataset where table1 is the clients, table2 is the review to match on here:
http://sqlfiddle.com/#!9/23928c/6/0
I was initially thinking of doing a date_diff, but then I need to sort by the lowest number. Before I tackle this on my own, I thought I would ask if anyone has any tricks they want to share.
I am using PHP / Laravel to query MySql

You can use DATEDIFF with absolute values:
ORDER BY ABS(DATEDIFF(`date`, $calculatedDate)) DESC
To find records that match your estimation closely, positive or negative.

Related

MySql: saving date ranges VS saving single day

I am currently working on a simple booking system and I need to select some ranges and save them to a mysql database.
The problem I am facing is deciding if it's better to save a range, or to save each day separately.
There will be around 500 properties, and each will have from 2 to 5 months booked.
So the client will insert his property and will chose some dates that will be unavailable. The same will happen when someone books a property.
I was thinking of having a separate table for unavailable dates only, so if a property is booked from 10 may to 20 may, instead of having one record (2016-06-10 => 2016-06-20) I will have 10 records, one for each booked day.
I think this is easier to work with when searching between dates, but I am not sure.
Will the performance be noticeable worse ?
Should I save the ranges or single days ?
Thank you
I would advise that all "events" go into one table and they all have a start and end datetime. Use of indexes on these fields is of course recommended.
The reasons are that when you are looking for bookings and available events - you are not selecting from two different tables (or joining them). And storing a full range is much better for the code as you can easily perform the checks within a SQL query and all php code to handle events works as standard for both. If you only store one event type differently to another you'll find loads of "if's" in your code and find it harder to write the SQL.
I run many booking systems at present and have made mistakes in this area before so I know this is good advice - and also a good question.
This is too much for a comment,So I will leave this as an answer
So the table's primary key would be the property_id and the Date of a particular month.
I don't recommend it.Because think of a scenario when u going to apply this logic to 5 or 10 years system,the performance will be worse.You will get approximately 30*12*1= 360 raws for 1 year.Implement a logic to calculate the duration of a booking and add it to table against the user.

Advanced statistics in PHP and MySQL

I have a slight problem. I have a dataset, which contains values measured by a weather station, which I want to analyze further using MySQL database and PHP.
Basically, the first column of the db contains the date and the other columns temperature, humidity, pressure etc.
Now, the problem is, that for the calculation of the mean, st.dev., max, min etc. it is quite simple. However there are no build-in commands for other parameters which I need, such as kurtosis etc.
What I need is for example to calculate the skewness, mean, stdev etc. for the individual months, then days etc.
For the build-in functions it is easy, for example finding some of the parameters for the individual months would be:
SELECT AVG(Temp), STD(Temp), MAX(Temp)
FROM database
GROUP BY YEAR(Date), MONTH(Date)
Obviously I cannot use this for the more advanced parameters. I thought about ways of achieving this and I could only think of one solution. I manually wrote a function, which processes the values and calculates the things such as kurtosis using the particular formulae. But, what that means is that I would need to create arrays of data for each month, day, etc. depending on what I am currently calculating. So for example, i would first need to take the data and split it into arrays lets say Jan11, Feb11, Mar11...... and each array would contain the data for that month. Then I would apply the function on those arrays and create new variables with the result (lets say kurtosis_jan11, kurtosis_feb11 etc.)
Now to my question. I need help with the splitting of data. The problem is that I dont know in advance which month the data starts and which it ends, so I cannot set fixed variables for this. The program first has to check the first month and then create new array for each month, day etc. until it reaches the last record. And for each it would create the array.
That of course would be maybe one solution but if anyone has any other ideas about how to go around this problem I would very much appreciate your help.
You can do more complex queries to achieve this. Here are some examples http://users.drew.edu/skass/sql/ , including Skew
SELECT AVG(Temp), STD(Temp), MAX(Temp)
FROM database
GROUP BY YEAR(Date), MONTH(Date)
having date between date_from and date_to
I think you want a group of data in between a data range.

codeigniter active record querying for results based on hours as well as by state

I have two queries ultimately I think they will be in the same context of the other but in all. I have a user database that I want to pull out for tracking records based on hour. Example registrations per hour. But in this registrations per hour I want to have the query to dump results by hour increments (or weeks, or months) ie: 1,000 regitrations in november, 1,014 in december and so on, or similar for weeks hours.
I also have a similar query where I want to generate a list of states with the counts next to them of how many users I have per state.
My issue is, I'm thinking I think to one dimensionally currently cause the best idea I can think of at the moment is making in the case of the states 50 queries, but I know thats insane, and there has to be an easier way thats less intense. So thats what Im hoping someone from here can help me with, by giving me a general idea. Cause I don't know which is the best course of action for this currently.. be it using distinct, group_by or something else.
Experiment a bit and see if that doesn't help you focus on the question a bit more.
Try selecting from your registrations per hour table and appending the time buckets you are interested in to the select list.
like this:
select userid, regid, date_time, week(date_time), year(date_time), day(date_time)
from registraions;
you can then roll up and count things in that table by using group by and an aggregate function like this:
select count(distinct userid), year(date_time)
from registraions
group by year(date_time)
Read about about date time functions:
MySQL Date Time Functions
Read about aggregate functions"
MySQL Group By

Time Prediction based on existing date:time records

I have a system that logs date:time and it returns results such as:
05.28.2013 11:58pm
05.27.2013 10:20pm
05.26.2013 09:47pm
05.25.2013 07:30pm
05.24.2013 06:24pm
05.23.2013 05:36pm
What I would like to be able to do is have a list of date:time prediction for the next few days - so a person could see when the next event might occur.
Example of prediction results:
06.01.2013 04:06pm
05.31.2013 03:29pm
05.30.2013 01:14pm
Thoughts on how to go about doing time prediction of this kind with php?
The basic answer is "no". Programming tools are not designed to do prediction. Statistical tools are designed for that purpose. You should be thinking more about R, SPSS, SAS, or some other similar tool. Some databases have rudimentary data analysis tools built-in, which is another (often inferior) option.
The standard statistical technique for time-series prediction is called ARIMA analysis (auto-regressive integrated moving average). It is unlikely that you are going to be implementing that in php/SQL. The standard statistical technique for estimating time between events is Poisson regression. It is also highly unlikely that you are going to be implementing that in php/SQL.
I observe that your data points are once per day in the evening. I might guess that this is the end of some process that runs during the day. The end time is based on the start time and the duration of the process.
What can you do? Often a reasonable prediction is "what happened yesterday". You would be surprised at how hard it is to beat this prediction for weather forecasting and for estimating the stock market. Another very reasonable method is the average of historical values.
If you know something about your process, then an average by day of the week can work well. You can also get more sophisticated, and do Monte Carlo estimates, by measuring the average and standard deviation, and then pulling a random value from a statistical distribution. However, the average value would work just as well in your case.
I would suggest that you study a bit about statistics/data mining/predictive analytics before attempting to do any "predictions". At the very least, if you really have a problem in this domain, you should be looking for the right tools to use.
As Gordon Linoff posted, the simple answer is "no", but you can write some code that will give a rough guess on what the next time will be.
I wrote a very basic example on how to do this on my site http://livinglion.com/2013/05/next-occurrence-in-datetime-sequence/
Here's a possible way that this could be done, using PHP + MySQL:
You can have a table with two fields: a DATE field and a TIME field (essentially storing the date + time portion separately). Say that the table is named "timeData" and the fields are:
eventDate: date
eventTime: time
Your primary key would be the combination of eventDate and eventTime, so that they're never repeated as a pair.
Then, you can do a query like:
SELECT eventTime, count(*) as counter FROM timeData GROUP BY eventTime ORDER BY counter DESC LIMIT 0, 10
The aforementioned query will always return the first 10 most frequent event times, ordered by frequency. You can then order these again from smallest to largest.
This way, you can return quite accurate time prediction results, which will become even more accurate as you gather data each day

Twitter style trends with php/mysql

I am coding a social network and I need a way to list the most used trends, All statuses are stored in a content field, so what it is exactly that I need to do is match hashtag mentions such as: #trend1 #trend2 #anothertrend
And sort by them, Is there a way I can do this with MySQL? Or would I have to do this only with PHP?
Thanks in advance
The maths behind trends are somewhat complex; machine learning may be a bit over the top, but you probably need to work through some examples.
If you go with #deadtrunk's sample code, you would miss trends that have fired up in the last half hour; if you go with #eggyal's example, you miss trends that have been going strong all day, but calmed down in the last half hour.
The classic solution to this problem is to use a derivative function (http://en.wikipedia.org/wiki/Derivative); it's worth building a sample database and experimenting with this, and making your solution flexible enough to change this over time.
Whilst you want to build something simple, your users will be used to trends, and will assume it's broken if it doesn't work the way they expect.
You should probably extract the hash tags using PHP code, and then store them in your database separately from the content of the post. This way you'll be able to query them directly, rather then parsing the content every time you sort.
I think it is better to store tags in dedicated table and then perform queries on it.
So if you have a following table layout
trend | date
You'll be able to get trends using following query:
SELECT COUNT(*), trend FROM `trends` WHERE `date` = '2012-05-10' GROUP BY trend
18 test2
7 test3
Create a table that associates hashtags with statuses.
Select all status updates from some recent period - say, the last half hour - joined with the hashtag association table and group by hashtag.
The count in each group is an indication of "trend".

Categories