Any Elegant Ideas on how to parse this Dataset? - php

I'm using PHP 5.3 to receive a Dataset from a web service call that brings back information on one or many transactions. Each transaction's return values are delimited by a pipe (|), and beginning/ending of a transaction is delimited by a space.
2109695|49658|25446|4|NSF|2010-11-24 13:34:00Z 2110314|45276|26311|4|NSF|2010-11-24 13:34:00Z 2110311|52117|26308|4|NSF|2010-11-24 13:34:00Z (etc)
Doing a simple split on space doesn't work because of the space in the datetime stamp. I know regex well enough to know that there are always different ways to break this down, so I thought getting a few expert opinions would help me come up with the most airtight regex.

If each timestamp is going to have a Z at the end you can use positive lookbehind assertion to split on space only if it's preceded by a Z as:
$transaction = preg_split('/(?<=Z) /',$input);
Once you get the transactions, you can split them on | to get the individual parts.
Codepad link
Note that if your data has a Z followed a space anywhere else other than the timestamp, the above logic will fail. To overcome than you can split on space only if it's preceded by a timestamp pattern as:
$transaction = preg_split('/(?<=\d\d:\d\d:\d\dZ) /',$input);

As others have said, if you know for sure that there will be no Z characters anywhere other than in the date, you could just do:
$records = explode('Z', $data);
But if you have them elsewhere, you'll need to do something a bit fancier.
$regex = '#(?<=\d{2}:\d{2}:\d{2}Z)\s#i';
$records = preg_split($regex, $data, -1, PREG_SPLIT_NO_EMPTY);
Basically, that record looks for the time portion (00:00:00) followed by a Z. Then it splits on the following white-space character...

Each timestamp is going to have a Z at the end so explode it by 'Z '. You don't need a regular expression. There's no chance that the date has a Z after it only the time.
example

Use explode('|', $data) function

Related

Selecting thousands separator character with RegEx

I need to change the decimal separator in a given string that has numbers in it.
What RegEx code can ONLY select the thousands separator character in the string?
It need to only select, when there is number around it. For example only when 123,456 I need to select and replace ,
I'm converting English numbers into Persian (e.g: Hello 123 becomes Hello ۱۲۳). Now I need to replace the decimal separator with Persian version too. But I don't know how I can select it with regex. e.g. Hello 121,534 most become Hello ۱۲۱/۵۳۴
The character that needs to be replaced is , with /
Use a regular expression with lookarounds.
$new_string = preg_replace('/(?<=\d),(?=\d)/', '/', $string);
DEMO
(?<=\d) means there has to be a digit before the comma, (?=\d) means there has to be a digit after it. But since these are lookarounds, they're not included in the match, so they don't get replaced.
According to your question, the main problem you face is to convert the English number into the Persian.
In PHP there is a library available that can format and parse numbers according to the locale, you can find it in the class NumberFormatter which makes use of the Unicode Common Locale Data Repository (CLDR) to handle - in the end - all languages known to the world.
So converting a number 123,456 from en_UK (or en_US) to fa_IR is shown in this little example:
$string = '123,456';
$float = (new NumberFormatter('en_UK', NumberFormatter::DECIMAL))->parse($string);
var_dump(
(new NumberFormatter('fa_IR', NumberFormatter::DECIMAL))->format($float)
);
Output:
string(14) "۱۲۳٬۴۵۶"
(play with it on 3v4l.org)
Now this shows (somehow) how to convert the number. I'm not so firm with Persian, so please excuse if I used the wrong locale here. There might be options as well to tell which character to use for grouping, but for the moment for the example, it's just to show that conversion of the numbers is taken care of by existing libraries. You don't need to re-invent this, which is even a sort of miss-wording, this isn't anything a single person could do, or at least it would be sort of insane to do this alone.
So after clarifying on how to convert these numbers, question remains on how to do that on the whole text. Well, why not locate all the potential places looking for and then try to parse the match and if successful (and only if successful) convert it to the different locale.
Luckily the NumberFormatter::parse() method returns false if parsing did fail (there is even more error reporting in case you're interested in more details) so this is workable.
For regular expression matching it only needs a pattern which matches a number (largest match wins) and the replacement can be done by callback. In the following example the translation is done verbose so the actual parsing and formatting is more visible:
# some text
$buffer = <<<TEXT
it need to only select , when there is number around it. for example only
when 123,456 i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello 123" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello 121,534" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
TEXT;
# prepare formatters
$inFormat = new NumberFormatter('en_UK', NumberFormatter::DECIMAL);
$outFormat = new NumberFormatter('fa_IR', NumberFormatter::DECIMAL);
$bufferWithFarsiNumbers = preg_replace_callback(
'(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u',
function (array $matches) use ($inFormat, $outFormat) {
[$number] = $matches;
$result = $inFormat->parse($number);
if (false === $result) {
return $number;
}
return sprintf("< %s (%.4f) = %s >", $number, $result, $outFormat->format($result));
},
$buffer
);
echo $bufferWithFarsiNumbers;
Output:
it need to only select , when there is number around it. for example only
when < 123,456 (123456.0000) = ۱۲۳٬۴۵۶ > i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello < 123 (123.0000) = ۱۲۳ >" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello < 121,534 (121534.0000) = ۱۲۱٬۵۳۴ >" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
Here the magic is just two bring the string parts into action with the number conversion by making use of preg_replace_callback with a regular expression pattern which should match the needs in your question but is relatively easy to refine as you define the whole number part and false positives are filtered thanks to the NumberFormatter class:
pattern for Unicode UTF-8 strings
|
(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u
| | |
| grouping character |
| |
word boundary -----------------+
(play with it on regex101.com)
Edit:
To only match the same grouping character over multiple thousand blocks, a named reference can be created and referenced back to it for the repetition:
(\b[1-9]\d{0,2}(?:(?<grouping_char>[ ,.])\d{3}(?:(?&grouping_char)\d{3})*)?\b)u
(now this get's less easy to read, get it deciphered and play with it on regex101.com)
To finalize the answer, only the return clause needs to be condensed to return $outFormat->format($result); and the $outFormat NumberFormatter might need some more configuration but as it is available in the closure, this can be done when it is created.
(play with it on 3v4l.org)
I hope this is helpful and opens up a broader picture to not look for solutions only because hitting a wall (and only there). Regex alone most often is not the answer. I'm pretty sure there are regex-freaks which can give you a one-liner which is pretty stable, but the context of using it will not be very stable. However not saying there is only one answer. Instead bringing together different levels of doings (divide and conquer) allows to rely on a stable number conversion even if yet still unsure on how to regex-pattern an English number.
You can write a regex to capture numbers with thousand separator, and then aggregate the two numeric parts with the separator you want :
$text = "Hello, world, 121,534" ;
$pattern = "/([0-9]{1,3}),([0-9]{3})/" ;
$new_text = preg_replace($pattern, "$1X$2", $text); // replace comma per 'X', keep other groups intact.
echo $new_text ; // Hello, world, 121X534
In PHP you can do that using str_replace
$a="Hello 123,456";
echo str_replace(",", "X", $a);
This will return: Hello 123X456

php preg_replace remove thousand separator in a string

there have a long articles, I want only remove thousand separator, not a comma.
$str = "Last month's income is 1,022 yuan, not too bad.";
//=>Last month's income is 1022 yuan, not too bad.
preg_replace('#(\d)\,(\d)#i','???',$str);
How to write the regex patterns? Thanks
If the simplified rule "Match any comma that lies directly between digits" is good enough for you, then
preg_replace('/(?<=\d),(?=\d)/','',$str);
should do.
You could improve it by making sure that exactly three digits follow:
preg_replace('/(?<=\d),(?=\d{3}\b)/','',$str);
If you have a look at the preg_replace documentation you can see that you can write captures back in the replacement string using $n:
preg_replace('#(\d),(\d)#','$1$2',$str);
Note that there is no need to escape the comma, or to use i (as there are not letters in the pattern).
An alternative (and probably more efficient) way is to use lookarounds. These are not included in the match, so they don't have to written back:
preg_replace('#(?<=\d),(?=\d)#','',$str);
The first (\d) is represented by $1, the second (\d) by $2. Therefore the solution is to use something like this:
preg_replace('#(\d)\,(\d)#','$1$2',$str);
Actually it would be better to have 3 numbers behind the comma to avoid causing havoc in lists of numbers:
preg_replace('#(\d)\,(\d{3})#','$1$2',$str);

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

I need a PHP regular expression to validate string format of 5 digits, one comma

I have a huge PHP input box on a webpage. This input should only take 5 digit string separated by commas:
00100,00247,90277,97030,00657
notice the last one has no comma at the end.
Is there a regular expression that can do this? Since the input box is very large and can take 100+ of these items, I want to validate it on the PHP server side before the database is queried and those avoid any SQL Injection tries.
Query is only run if only 5 numbers and a comma in the sequence, except for the last one.
These are a state's public water system ID's by the way.
I believe this will get the result you're looking for, though explode may be the better option.
/^(?:\d{5},)*\d{5}$/
This will only match 1 or more 5-digit numbers that are comma delimited with no spaces.
Since this is user submitted data, your validation should be more flexible. What if the user accidentally puts a space after one of the commas? Or a line break gets inserted?
I realize you are looking for a regex solution but may I suggest using explode to create an array and apply a rule to each element. Having them separated into elements allows more flexibility when validating and storing:
$nums = explode(',', '00100,00247,90277,97030,00657');
foreach ($nums as $num) {
if (!preg_match('/^\d{5}$)/', trim($num))) {
// error!
}
}
I'd explode it and validate each string individually:
$input = '00100,00247,90277,97030,00657';
$input_array = explode(',', $input);
$is_valid = true;
foreach ($input_array as $number) {
if (preg_match("/\\d/", trim($number)) != strlen(trim($number))) {
$is_valid = false;
}
}
print($is_valid);
I think you rather need str_getcsv:
while ($row = str_getcsv($fp)) {
// $row is an array containing your digits
}
Simple. This regex matches a value having one or more comma separated 5-digit numbers:
if (preg_match('/^\d{5}(\s*,\s*\d{5})*$/', $value)) {
// Good value
}
It allows whitespace between the numbers as well.
This might work:
/^\d{5}(?:,\d{5})*$/
edit 1 noticed ridgerunner has the same answer, so disregard this.
edit 2 some notes on performance.
Failure analysis
Backtracking give back on failure:
^\d{5}(?:,\d{5})*$ gives back ,\d{5}
^(?:\d{5},)*\d{5}$ gives back \d{5},
Post Backtracking regressive topography checks:
(After backtracking give back, checks are to the right of the one that gave back)
^\d{5}(?:,\d{5})*$ checks for $
^(?:\d{5},)*\d{5}$ checks for \d{5}$
Winner: ^\d{5}(?:,\d{5})*$
NON-Backtracking regex's (using possesive quantifier +):
^\d{5}(?:,\d{5})*+$ gives nothing back, fails immediately
^(?:\d{5},)*+\d{5}$ gives nothing back fails immediately
Benchmarks
Using a string of 50 blocks of \d{5},.
The sample string is matched against each regex in a loop of 100,000 times.
Failure was induced at the end of the string, removed for a sucess test.
Sucess:
All took 1 second to complete a sucessfull run.
Failure, Backtracking:
^\d{5}(?:,\d{5})\*$ took 1.2 seconds best
^(?:\d{5},)\*\d{5}$ took 1.6 seconds
Failure, Non-Backtracking:
^\d{5}(?:,\d{5})*+$ took .9 seconds
^(?:\d{5},)*+\d{5}$ took .9 seconds
Conclusions
Backtracking - Put the smallest post-backtracking check
after the backtracking sub-expression. In this case, the
smallest is $.
In general, put the required expressions ahead of the optional ones.
Best ^\d{5}(?:,\d{5})*$
NON-Backtracking - It doesn't matter.
^\d{5}(?:,\d{5})*+$ or ^(?:\d{5},)*+\d{5}$

Filter array of numeric PIN code strings which may be in the format "######" or "### ###"

I have a PHP array of strings. The strings are supposed to represent PIN codes which are of 6 digits like:
560095
Having a space after the first 3 digits is also considered valid e.g. 560 095.
Not all array elements are valid. I want to filter out all invalid PIN codes.
Yes you can make use of regex for this.
PHP has a function called preg_grep to which you pass your regular expression and it returns a new array with entries from the input array that match the pattern.
$new_array = preg_grep('/^\d{3} ?\d{3}$/',$array);
Explanation of the regex:
^ - Start anchor
\d{3} - 3 digits. Same as [0-9][0-9][0-9]
? - optional space (there is a space before ?)
If you want to allow any number of any whitespace between the groups
you can use \s* instead
\d{3} - 3 digits
$ - End anchor
Yes, you can use a regular expression to make sure there are 6 digits with or without a space.
A neat tool for playing with regular expressions is RegExr... here's what RegEx I came up with:
^[0-9]{3}\s?[0-9]{3}$
It matches the beginning of the string ^, then any three numbers [0-9]{3} followed by an optional space \s? followed by another three numbers [0-9]{3}, followed by the end of the string $.
Passing the array into the PHP function preg_grep along with the Regex will return a new array with only matching indeces.
If you just want to iterate over the valid responses (loop over them), you could always use a RegexIterator:
$regex = '/^\d{3}\s?\d{3}$/';
$it = new RegexIterator(new ArrayIterator($array), $regex);
foreach ($it as $valid) {
//Only matching items will be looped over, non-matching will be skipped
}
It has the benefit of not copying the entire array (it computes the next one when you want it). So it's much more memory efficient than doing something with preg_grep for large arrays. But it also will be slower if you iterate multiple times (but for a single iteration it should be faster due to the memory usage).
If you want to get an array of the valid PIN codes, use codaddict's answer.
You could also, at the same time as filtering only valid PINs, remove the optional space character so that all PINs become 6 digits by using preg_filter:
$new_array = preg_filter('/^(\d{3}) ?(\d{3})$/D', '$1$2', $array);
The best answer might depend on your situation, but if you wanted to do a simple and low cost check first...
$item = str_replace( " ", "", $var );
if ( strlen( $item ) !== 6 ){
echo 'fail early';
}
Following that, you could equally go on and do some type checking - as long as valid numbers did not start with a 0 in which case is might be more difficult.
If you don't fail early, then go on with the regex solutions already posted.

Categories