Search Query Parser - php

My App is supposed to search the stored Images based on the search Query. The User can search in label, description, people tagged in, time posted . for that I am trying to make a Search Query parser that accepts wild card (*) #TaggedPeopleName (DateFrom - DateTo) #Place and all other texts to match label and description. My question is am I reinventing the wheel ? or there already exists such parser may be with similar functionality ?
Example Queries are:
#JohnLenon 500 Miles
will return the Images that Match 500 Miles in Label or in description and has a Tag of John Lenon
(24 Dec - 30 Dec)
will return all Images uploaded in that time Frame.
#Kolkata (24 dec - 31 Dec) Occupy Together
will return all Images that Match the String Occupy Together in Label or in description and withing the Time Frame 24 dec to 31 Dec and Taken at the Place Kolkata
If some Library already does this may be with different syntax I'll accept. as I am not sticked to this syntax only

To my knowledge, there is nothing that does this automatically for you - it's way too particular to your situation.
I would break it down in chunks to make it easy.
Search for all terms starting with # - remove them.
Search for all terms starting with # - remove them.
Search for all terms surrounded by () - remove them.
What's left is the general search term.
Things to think about:
What if someone wants to search for a term in a description that starts with # or #?
What is the format of the () terms - most people won't naturally format dates as you have?
What if someone just puts in junk between the ()?
What if two words are separated by another token, but after removing them are put together?
Or what if the person puts two words together but doesn't want them searched for as one term?

Related

Fuzzy date match

I have a mysql db of clients and crawled a website retrieving all the reviews for the past few years. Now I am trying to match those reviews up with the clients so I can email them. The problem is that the review site allowed them to enter anything they wanted for the name, so in some cases I have full first name and last initial, and in some cases first initial and last full name. It also gives an approximate time it was posted such as "1 week ago", "6 months ago" and so on which we already have converted to an approximate date.
Now I need to try matching those up to the clients. Seems the best way would be to do a fuzzy search on the names, and then once I find all John B% I look for the one with a job completion date nearest the posting of the review naturally eliminating anything that was posted before jobs were completed.
I put together a small sample dataset where table1 is the clients, table2 is the review to match on here:
http://sqlfiddle.com/#!9/23928c/6/0
I was initially thinking of doing a date_diff, but then I need to sort by the lowest number. Before I tackle this on my own, I thought I would ask if anyone has any tricks they want to share.
I am using PHP / Laravel to query MySql
You can use DATEDIFF with absolute values:
ORDER BY ABS(DATEDIFF(`date`, $calculatedDate)) DESC
To find records that match your estimation closely, positive or negative.

Extract an 8 character integer string from an ical file

This code to get all sequences of 8 integers works fine:
preg_match_all('/[0-9]{8}/', $string, $match);
However I am only interested if the match starts with 20.
I know I have to add ^20 somewhere but I have tried many times with no success. I have looked at many regex tutorials but none of them seems to explain how to do 2 separate searches.
I am actually trying to parse ICAL files to extract the dates. If the 8 digit integer starts with 20 it almost certainly is a date.
For example: DTSTART:20150112T120000Z
How about this solution:
/(20)\d{6}/
This will probably find what you are looking for:
(?=20)(\d{8})
It does a positive lookahead to capture a group if it starts with 20 along with a 8 digit number.
The answer highly depends on what you want to achieve. Do you want to extract all and any dates from an icalendar file. If so, you might be missing birthday dates as their year are most likely to be starting with 19xx.
Also matching any dates will yield most likely many undesired dates like UNTIL, TRIGGER, DTEND, ...
Assuming from your example you want to extract events start dates, you could try:
DTSTART[a-zA-Z._%+-/=;]*:(\d){8}[T]?[\d]{6}
To be kept in mind: following DTSTART can be a timezone definition like TZID=America/New_York and/or the type definition DATE or DATE-TIME (see RFC5545 DATE-TIME

Searchable date/time durations in SMW

I'm using Mediawiki, with the SMW extension, in a private setting to organize a fictional universe my group is creating. I have some functionality I'd like, and I'd like to know if there is an extension out there, or the possibility of making my own, that would let me do what I want to do.
Basically, in plain english... I want to know what's going on in the universe at a specific point (or duration) in time.
I'd like to give the user an ability to give a date (as simple as a year, or as precise as necessary), or duration, and return every event which has a duration that overlaps.
For example.
Event 1 starts ~ 3200 BCE, and ends ~ 198 BCE
Event 2 starts ~509 BCE and ends ~ 405 CE
Event 3 starts 1/15/419 CE and ends 1/17/419 CE
Event 4 starts ~409 BCE and ends on 2/14/2021 CE
User inputs a date (partial, in this instance) 309 BCE.
Wiki returns Event 1, and Event 4, as the given date is within the duration of both.
This would allow my creators to query a specific date (and hopefully a duration) and discover what events are already taking place, so they can adjust their works according to what is already established. It's a simple conflict checker.
If there's no extension available that can handle this, is there anything like this anywhere I can research? I've never dealt with dates in PHP. I'm a general business coder, I've never done complex applications.
There is no built in “duration” data type in SMW, so the easiest approach would probably be to use one date property for the starting date, and one for the ending date (note that it must be BC/AD, not BCE/CE or similar):
[[Event starts at::3200 BC]]
[[Event ends at::198 BC]]
then you can query for each event that has a starting date before, and an ending date after a certain date:
{{#ask:[[Event starts at::<1000 BC]] [[Event ends at::>1000 BC]]}}
Note that > actually means “greater or equal to” in SMW query syntax.

Parsing input from user in any order or format

I am having some trouble trying to figure out how to parse information collected from user. The information I am collecting is:
Age
Sex
Zip Code
Following are some examples of how I may receive this from users:
30 Male 90250
30/M/90250
30 M 90250
M 30 90250
30-M-90250
90250,M,30
I started off with explode function but I was left with a huge list of if else statements to try to see how the user separated the information (was it space or comma or slash or hypen)
Any feedback is appreciated.
Thanks
It's easy enough. The ZIP code is always 5 digits, so a simple regex matching /\d{5}/ will work just fine. The Age is a number from 1 to 3 digits, so /\d{1,3}/ takes care of that. As for the gender, you could just look for an f for female and if there isn't one assume male.
With all that said, what's wrong with separate input fields?
You might want to use a few regular expressions:
One that looks for 5 numeric digits: [^\d]\d{5}[^\d]
One that looks for 2 numeric digits: [^\d]\d{2}[^\d]
One that looks for a single letter: [a-zA-Z]
[EDIT]
I've edited the RegExes. They now match every one of the presented alternatives, and don't require any alteration of the input string (which makes it a more efficient choice). They can also be run in any order.

Tricky file parsing. Inconsistent Delimeters

I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.
I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).
Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...
You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.
While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)
I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.

Categories