Having a hard time with this one as I don't think I know all of my options.
I have to parse a free form text field that I need to map the values to a database.
Here is some example text, NOTE: not all fields have to be there, not all delimiters are the same and not all descriptors are available. I do need to check if the value is numeric only or is it alpha numeric.
Example 1
field1: 999-999234-24-2
field2 Description: a short description
field3: 3.222.1
asdfg
field number four: NO
field5:
Example 2
field1: 999-999234-24-2/field2 Description: a short description/field3: 3.222.1 asdfg/field number four: NO/field5:
Example 3
999-999234-24-2
Example 4
field1: 999-999234-24-2 field2 Description: a short description field3: 3.222.1 asdfg field number four: NO field5:
Example 5
field1: 999-999234-24-2 - field2 Description: a short description - field3: 3.222.1 asdfg - field number four: NO - field5:
What I would like is all fields X to be in there own column. NOTE the example data is all in the same order but live data is not.
Now I don't mind doing this in steps if I need to but having a hard time just parsing the values up into columns. any suggestions?
I was thinking some sort of case function with a RegEx but not luck so far.
Maybe you should standardize on the java .properties format then you can use this PHP example to parse it:
http://www.innerweaver.com/?p=13
Since it's still stuck in my head ... the way I'd go about it is start handling each of these cases and see if there is any remaining tweaks/fallout. What appears to make this tricky is the only reliable deliminator is 'field', and if anyone uses that in a description it'll break. I'd just have to take the file and start iterating.
Splitting it with this regex would at least be a good start point for dividing the headers and the data. Basically, field plus additional optional text that covers the possibility of 'Description' and 'number four' added before the closing :
field[^:]{0,12}:
After that, you'd at least have to strip trailing / for case #2, the ' - ' for case #5, the extra linebreaks if you don't want them in the data for case #1.
RegEXP would be hard to maintain in some edge-cases. Try writing a simple finite state machine
after much though/trial and error I'm going to read them into an array and parse out each line of text. It's long and going to be a mess but should get the job done.
Related
I had a MySQL table with some user data, which I needed to correct and migrate to a new MySQL table. I exported the table using "Export to PHP Array plugin for PHPMyAdmin" from "Geoffray Warnants" and it returned a (PHP) array.
One of the fields contains a telephone number. Now some of the entries have been exported as string. However, some of the entries have the telephone number represented as an integer. When I try to read the number, it returns something like:
4.36991052022E+12
when it should be:
4369910520219
I suppose the integer value is too big, so that must be the problem. (that's the reason for the E+12)
I have close to 300 entries and there is no way I can start writing quotes in front and end of the number manually, since I also have a fax field.
Most recently, I tried (with help of demo sublime text 2) to cast the number by writing (string) in front of it - it doesn't work.
I'm kind of helpless now and ask for your help. What can I do?
Please take a look at this question, which should answer yours:
Convert a big integer to a full string in PHP
Since I didn't have the time to get trough the "complicated" process of installing the GMP library, I decided to make it old-skool and just put double quotes ("") in front of every phone number value no matter if is was a string or a "big integer" and remove (single) quotes from the final string.
Thanks to Sublime Text 2!
So i had:
array(..., 'phone'=>' 43 664 1000383', ...);
and
array(..., 'phone'=>4369910520219, ...);
Search for and Find All 'phone'=> and add afterwards "
Then search for (in my case) ,'fax'=> and add beforehand "
The for every string
preg_replace("/\'/i", "", $user["phone"]);
Thanks though for the library. I might actually use it someday. ;)
Greetings,
Joseph
I'm looking for a solution to convert all numbers in a given range to another number in the same range, and later convert that number back.
More concrete, let's say I have the numbers 1..100.
The easiest way to convert all numbers to another one in the same range is to use: b = 99 -a; later get the original with a = 99 - b;.
My problem is that I want to simulate some randomness.
I want to implement this in PHP, but the coding language doesn't matter.
WHY?
You maybe say why? Good question :)
I am generating some easy to read short code string based on id-s, and because the id's are incremented one by one, my consecutive short codes are too similar.
Later I need to "decode" the short codes, to get the id.
What my algorithm is doing now is:
0000001 -> ababac, 0000002 -> ababad, 0000003 -> ababaf, etc.
later
ababac -> 0000001, ababad -> 0000002, ababaf -> 0000003, etc.
So before I actually generate the short code I want to "randomize" the number as much as possible.
Option 1:
Why dont you just have a database of conversion? i.e each record has a "real" id, and a "random md5" string or something
Option 2:
Use a rainbow table - maybe even a MD5 lookup table for the range 0 - 10,000 or whatever. Then just do a hashtable lookup
Finally I found a solution based on module operator, on the math forum.
The solution can be found here:
https://math.stackexchange.com/questions/259891/function-to-convert-each-number-in-a-m-n-to-another-number-in-the-same-range
I am having some trouble trying to figure out how to parse information collected from user. The information I am collecting is:
Age
Sex
Zip Code
Following are some examples of how I may receive this from users:
30 Male 90250
30/M/90250
30 M 90250
M 30 90250
30-M-90250
90250,M,30
I started off with explode function but I was left with a huge list of if else statements to try to see how the user separated the information (was it space or comma or slash or hypen)
Any feedback is appreciated.
Thanks
It's easy enough. The ZIP code is always 5 digits, so a simple regex matching /\d{5}/ will work just fine. The Age is a number from 1 to 3 digits, so /\d{1,3}/ takes care of that. As for the gender, you could just look for an f for female and if there isn't one assume male.
With all that said, what's wrong with separate input fields?
You might want to use a few regular expressions:
One that looks for 5 numeric digits: [^\d]\d{5}[^\d]
One that looks for 2 numeric digits: [^\d]\d{2}[^\d]
One that looks for a single letter: [a-zA-Z]
[EDIT]
I've edited the RegExes. They now match every one of the presented alternatives, and don't require any alteration of the input string (which makes it a more efficient choice). They can also be run in any order.
I have a form in which users can enter prices for items. Ideally I want the user to be able to add prices in whatever method feels best to them and also for readability. I then need to convert this to a standard float so that my web service can calculate costs etc.
The part I'm struggling with is how to take the initial sting/float/int of currency and convert it into a float.
For example:
UK: 1,234.00
FRA: 1 234,00
RANDOM: 1234
RANDOM2: 1234.00
All of those have slightly different formats.
Which I would want to store as:
1234.00
I will then store the result in MySQL database as a DECIMAL.
Any help would be great.
Assuming you're using MySQL, use the DECIMAL or NUMERIC type are the correct types used for storing currency.
Float's are susceptible to rounding errors and have a limited precision.
The formatting for display should be handled by PHP.
If storing in DB, you should of course store a currency code - which can be used when retrieving to tell PHP how to display it
Couldn't you use:
floatval($AnyVar)
In a case where you'd like to accept so many different formats it's a bit tricky to get it right.
Now we can just use a simple regex to get the decimal and full parts of the value:
/^([0-9,. ]+?)(?:[.,](\d{1,2})$|$)/
The regex will capture the full part of the number + a decimal part, separated with a , or a . and which has one or two numbers.
The capture group 1 will contain the full part, and group 2 the decimal part (if any).
To get your number, you just need to filter out all non-numeric characters from the full part, and join the filtered full and decimal parts together.
If you want to make it more foolproof, you probably should implement something on the client-side to guide the user to input the value in the correct format.
I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.
I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).
Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...
You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.
While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)
I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.