How to convert this PHP RegEx match pattern to .Net

How to convert this PHP RegEx match pattern to .Net - php

This is a pretty complex regular expression that returns an array of key/value pairs from a proprietary string of data. Here is sample of the data, in case the express can not be used in .Net and another method needs to be used.
0,"101"1,"12345"11,"ABC Company"12,"John Doe"13,"123 Main St"14,""15,"Malvern"16,"PA"17,"19355"19,"UPS"21,"10"22,"GND"23,""24,"082310"25,""26,"0.00"29,"1Z1235550300000645"30," PA 193 9-05"34,"6.55"37,"6.55"38,"8.05"65,"1Z1235550300000645"77,"10"96,""97,""98
If you look closely you see its key,"value",key,"value" The only guarantee on formatting is that each key value pair is separated by a comma, and each value will always be encased in double quotes. The main problem (the reason you cant explode it) is the poor choice of the previous coder to separate keys and values with the same character as the entries. Anyways, out of my hands. Here is a working PHP example.
function parseResponse($response) {
// split response into $key, $value pieces
preg_match_all("/(.*?),\"(.*?)\"/", $response, $m);
// loop through pieces and format
foreach($m[1] as $index => $key) {
$value = $m[2][$index]
echo $key . ":" . $value;
// this will output KEY:VALUE for each entry in the string
}
}
You can see the expression /(.*?),\"(.*?)\"/
Here is what I have in VB .Net
Imports System.Text.RegularExpressions
Public Class Parser
Private Sub parseResponse(ByVal response As String)
Dim regExMatch As Match = Regex.Match(response, "/(.*?),\""(.*?)\""/")
End Sub
End Class

You need to remove the PHP delimiters:
Dim RegexObj As New Regex("(.*?),""(.*?)""")
Also, better be more specific about what can be matched (makes the regex much more efficient):
Dim RegexObj As New Regex("([^,]*),""([^""]*)""")
Now the first group only matches characters that aren't commas, and the second one only matches characters that aren't quotes. Both regexes would fail, by the way, if you were to have escaped quotes in your data.
To get all matches in a string, use
AllMatchResults = RegexObj.Matches(response)

Related

Checking if an array key contains a value? [duplicate]

This question already has an answer here:
Splitting an array key and seeing if it contains a value
(1 answer)
Closed 6 years ago.
I have an array, I want to check if the keys have '-main' after them.
foreach ($data as $key => $value) {
if (substr($key, -5) == '-main'){
....
}
}
If they do have '-main' I then want to get the text prior to '-main'. I do:
$myVar = substr($key, 0, -5);
Is there a more efficient way of splitting the key so I don't have to do two sub strings?
Perhaps I do not want to use '-main' any more and want to use a different length search item, perhaps as a variable. I would then have to do a character count rather than specifying -5. Is there a way to incorporate a variable without having to do character counts?

You can use a regex search (http://php.net/manual/de/function.preg-match.php) to do the task.
if (preg_match('/^(.*)(-main)$/', $key, $hit)) {
$myVar = $hit[1];
// explanation:
// $hit[0] will contain the whole result string
// $hit[1] will contain the part before "-main"
// $hit[2] will be "-main"
}
The regular expression is the following:
/ ... / -> requiered by preg_match
^ -> beginning of string (we want to start at the first position
(-main) -> the search text we're using
$ -> end of the string (we don't want a -main in the middle of the string to match)
(.*) anything else in the string between the start and "-main" - the paranthesis means, that this will be returned as one part of $hits
If you switch the text you want to search for, keep in mind that certain characters have a special meaning in a regex. So you might need to escape them.

Best way to parse this string and create an array from it

I have the follow string:
{item1:test},{item2:hi},{another:please work}
What I want to do is turn it into an array that looks like this:
[item1] => test
[item2] => hi
[another] => please work
Here is the code I am currently using for that (which works):
$vf = '{item1:test},{item2:hi},{another:please work}';
$vf = ltrim($vf, '{');
$vf = rtrim($vf, '}');
$vf = explode('},{', $vf);
foreach ($vf as $vk => $vv)
{
$ve = explode(':', $vv);
$vx[$ve[0]] = $ve[1];
}
My concern is; what if the value has a colon in it? For example, lets say that the value for item1 is you:break. That colon is going to make me lose break entirely. What is a better way of coding this in case the value has a colon in it?

Why not to set a limit on explode function. Like this:
$ve = explode(':', $vv, 2);
This way the string will split only at the first occurrence of a colon.

To address the possibility of the values having embedded colons, and for the sake of discussion (not necessarily performance):
$ve = explode(':', $vv);
$key = array_shift($ve);
$vx[$key] = implode(':', $ve);
...grabs the first element of the array, assuming the index will NOT have a colon in it. Then re-joins the rest of the array with colons.

Don't use effing explode for everything.
You can more reliably extract such simple formats with a trivial key:value regex. In particular since you have neat delimiters around them.
And it's far less code:
preg_match_all('/{(\w+):([^}]+)}/', $vf, $match);
$array = array_combine($match[1], $match[2]);
The \w+ just matches an alphanumeric string, and [^}]+ anything that until a closing }. And array_combine more easily turns it into a key=>value array.

Answering your second question:
If your format crashes with specific content it's bad. I think there are 2 types to work around.
Escape delimiters: that would be, every colon and curly brackets have to be escaped which is strange, so data is delimited with e.g. " and only those quotation marks are escaped (than you have JSON in this case)
Save data lengths: this is a bit how PHP serializes arrays. In that data structure you say, that the next n chars is one token.
The first type is easy to read and manipulate although one have to read the whole file to random access it.
The second type would be great for better random accessing if the structure doesn't saves the amount of characters (since in UTF-8 you cannot just skip n chars by not reading them), but saving the amount of bytes to skip. PHP's serialize function produce n == strlen($token), thus I don't know what is the advantage over JSON.
Where possible I try to use JSON for communication between different systems.

Delete multiple file for/while

I have a php pull down that I select an item and delete
all files associated with it.
It works well if there was only 5 or 6. After I put the
first 4 to test and get it working I realized it could
take a very long time to enter in a couple hundred and
would blot the script.
Not knowing enough about for and while loops is there
anyone that might have a way to help?
There will never be more than one set deleted at a time.
Thanks in advance.
<?php
$workitem = $_POST["workitem"];
$workdirPAth = "/var/work.files/";
if($workitem == 'item1.php')
{
unlink("$workdirPath/page1.php");
unlink("$workdirPath/temp1.php");
unlink("$workdirPath/all1.php");
}
if($workitem == 'item2.php')
{
unlink("$workdirPath/page2.php");
unlink("$workdirPath/temp2.php");
unlink("$workdirPath/all2.php");
}
if($workitem == 'item3.php')
{
unlink("$workdirPath/page3.php");
unlink("$workdirPath/temp3.php");
unlink("$workdirPath/all3.php");
}
if($workitem == 'item4.php')
{
unlink("$workdirPath/page4.php");
unlink("$workdirPath/temp4.php");
unlink("$workdirPath/all3.php");
?>

Some simple pattern matching and substitution is all you need here.
First, the code:
1. if (preg_match('/^item(\d+)\.php$/', $workitem, $matches)) {
2. $number = $matches[1];
3. foreach(array('page','temp','all') as $base) {
4. unlink("$workdirPath/$base$number.php");
5. }
6. } else {
7. # unrecognized work item value; complain to user or whatever
8. }
The preg_match function takes a pattern, a string, and an array. If the string matches the pattern, the parts that match are stored in the array. The particular type of pattern is a *p*erl5-compatible *reg*ular expression, which is where the preg_ part of the name comes from.
Regular expressions are scary-looking to the uninitiated, but they're a handy way to scan a string and get some values out of it. Most characters just represent themselves; the string "foo" matches the regular expression /foo/. But some characters have special meanings that let you make more general patterns to match a whole set of strings where you don't have to know ahead of time exactly what's in them.
The /s just mark the beginning and end of the actual regular expression; they're there because you can stick additional modifier flags inside the string along with the expression itself.
The ^and $ arepresent the beginning and end of the string. "/foo/" matches "foo", but also "foobar", "bunnyfoofoo", and so on - any string that contains "foo" will match. But /^foo$/ matches only "foo" exactly.
\d means "any digit". + means "one or more of that last thing". So \d+ means "one or more digits".
The period (.) is special; it matches any character at all. Since we want a literal period, we have to escape it with a backslash; \. just matches a period.
So our regular expression is '/^item\d+\.php$/', which will match any itemnumber.php filename. But that's not quite enough. The preg_match function is basically a binary test: does the string match the pattern or not, yes or no? In this case, it's not enough to just say "yup, the string is valid"; we need to know which items specifically the user specified. That's what capture groups are for. We use parentheses to say "remember what matched this part", and provide an array name that gets filled with those remembrances.
The part of the string that matches the whole regular expression (which may not be the whole string, if the regular expression isn't anchored with ^...$ like this one is) is always put in element 0 of the array. If you use parentheses in the regular expression, then the part of the string that matches the part of the regular expression inside the first pair of parentheses is stored in element 1 of the array; if there's a second set of parentheses, the matching part of the string goes in element 2 of the array, and so on.
So we put parentheses around our number ((\d+)) and then the actual number will be remembered in element 1 of our $matches array.
Great, we have a number. Now we just need to use it to build up the filenames we want to delete.
In each case, we want to delete three files: page$n.php, temp$n.php, and all$n.php, where $n is the number we extracted above. We could just put three unlink calls, but since they're all so similar, we can use a loop instead.
Take the different prefixes that are the same no matter the number, and make an array out of them. Then loop over that array. In the body of the loop, the variable $base will contain whichever element of the array it's currently on. Stick that between the $workdirPath prefix and the $number we got from the match, append .php, and that's your file. unlink it and go back to the top of the loop to grab the next one.

PHP Regex to identify keys in array representation

I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.

Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )

Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working

PHP Extract Similar Parts from Multiple Strings

I'm trying to extract the parts which are similar from multiple strings.
The purpose of this is an attempt to extract the title of a book from multiple OCRings of the title page.
This applies to only the beginning of the string, the ends of the strings don't need to be trimmed and can stay as they are.
For example, my strings might be:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='published by xyz publisher the historv of the internot, expanded and';
$title[3]='history of the internet';
So basically I would want to trim each string so that it starts at the most probable starting point. Considering that there may be OCR errors (e.g. "historv", "internot") I thought it might be best to take the number of characters from each word, which would give me an array for each string (so a multi-dimensional array) with a the length of each word. This can then be used to find running matches and trim the beginnings of the string to the most likely.
The strings should be cut to:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='the historv of the internot, expanded and';
$title[3]='XXX history of the internet';
So I need to be able to recognize that "history of the internet" (7 2 3 8) is the run which matches all strings, and that the preceding "the" is most probably correct seeing as it occurs in >50% of the strings, and therefore the beginning of each string is trimmed to "the" and a placeholder of the same length is added onto the string missing "the".
So far I have got:
function CompareSimilarStrings($array)
{
$n=count($array);
// Get length of each word in each string >
for($run=0; $run<$n; $run++)
{
$temp=explode(' ',$array[$run]);
foreach($temp as $key => $val)
$len[$run][$key]=strlen($val);
}
for($run=0; $run<$n; $run++)
{
}
}
As you can see, I'm stuck on finding the running matches.
Any ideas?

You should look into Smith-Waterman algorithm for local string alignment. It is a dynamic programming algorithm which finds parts of the string which are similar in that they have low edit distance.
So if you want to try it out, here is a php implementation of the algorithm.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to convert this PHP RegEx match pattern to .Net - php

Related

Checking if an array key contains a value? [duplicate]

Best way to parse this string and create an array from it

Delete multiple file for/while

PHP Regex to identify keys in array representation

PHP Extract Similar Parts from Multiple Strings

Categories

Resources