Seeking regular expression to find unquoted strings used as PHP array indices

Seeking regular expression to find unquoted strings used as PHP array indices - php

To skip the explanation & get to the actual question, jump to below the horizontal line.
PHP variables must start with a $ dollar sign; e.g $my_array
PHP arrays are indexed by scalars (like a C++ STL map or a Python dictionary (and similar to JSON)); e.g $my_array[0] or $my_array[$my_val], or, with strings, $my_array['abc'] (single quote, or $my_array["abc"} (double quote).
I have some inherited code which does not quote the string index : $my_array[abc]. That was allowed in previous versions of PHP, but is now causing me problems.
So, I am seeking a reg ex to find an open square bracket [ followed by a character (a .. z, A ... Z); alternatively, followed by not a digit or dollar sign, whichever is easier.
Astute PHP coders will have seen that I can DEFINE(abc, 'abc') and use $my_array[abc], but that's an outlier & I will handle it manually.

You should be able to use something like this:
/(?<varname>\$[a-z_]\w*) \s*\[\s* (?<index>[a-z_]\w*) \s*\]/ix
This matches a variable name followed by a square bracket followed by a valid index, ignoring spaces along the way.
You can easily omit the beginning of the pattern if you don't want the variable name to be part of it (in case you're doing this as well with functions' return values or anything else).
Demo: https://regex101.com/r/8hbxs9/6

Related

Is there a way to delimit "ucwords()" in PHP such that the first char is not automatically uppercased?

PHP has the function ucwords(), which allows for custom delimiters. This works well, and will turn my test string into My Test String no problem.
Take the following example: I want to make a super awesome 2009 gamer tag.
$gamerTag = 'xxx_l33t_xxx'; // Not yet epic.
echo ucwords($gamerTag,"x"); // want it to return 'xXx_l33t_xXx'
I would have assumed strings would delimit case-sensitively and update the the second x in each case, ignoring the third, since at that point the middle one would no longer match our delimiter.
However, this actually returns XxX_l33t_xXx, since it will automatically uppercase the first letter in the string.
I know that there are other methods of doing this (strsplit() array loops and pregreplace with a reverse lookup come to mind), but my primary question becomes the following:
Is there a way to delimit ucwords() such that it does not automatically uppercase the first character of the string?

The internal behaviour is unfortunately that the first character of the string will always be converted to upper case, regardless of the delimiters you pass in.
Digging into the PHP source, this is the implementation of ucwords:
*r = toupper((unsigned char) *r);
for (r_end = r + Z_STRLEN_P(return_value) - 1; r < r_end; ) {
if (mask[(unsigned char)*r++]) {
*r = toupper((unsigned char) *r);
}
}
From https://github.com/php/php-src/blob/master/ext/standard/string.c#L2651
Here r is the return value, and mask is a char array of the delimiting characters. The first call to toupper (outside the of the loop) means that there's no way to prevent the first character being converted.
Because this is done, it means the second character is not converted, since it's now preceded by X, not x. The third character is handled "correctly".
This can actually cause some strange cascading behaviour, since the return value is being iterated over while it's being modified:
php > echo ucwords('aaa', 'A');
AAA
The initial string doesn't contain the delimiting character anywhere, but the result is completely upper-case.
As mentioned in a comment, there's an open PHP bug to reflect this behaviour in the documentation here: https://bugs.php.net/bug.php?id=78393

tcl query for string pattern matching

I have a Tcl variable with the following contents:
m_hscaclbmbmmer_v11_2
m_letmmcbterbox_v1_2
m_osbbbbcmd_v16_0 v_proc_ss_v1_0
m_rmgbbb2cycrcb_v17_4
m_nscalbbbcer_v8_2
m_smpte2mbbc022_m12_rx_v2_3
m_smpte2m02cm2_12_tx_v2_2
m_smpte20m2mcm2_56_rx_v5_4
m_smpte202m2_c56_tx_v4_0
m_smpbbbte_sdbbci_v3_0
m_smpte_uhdmsdic_v1_2
m_tmmmc_v6_4
m_tpmmmg_v17_1
m_vcresamcpler_v1_2
m_vid_mminc_axi4s_v14_1
m_voip_femcc_rx_v11_3
m_voip_fmbmecc_tx_v1_2
m_vscalebcr_v1_4
m_ycrcb2bncrgb_v7_4
mid_phy_cnmcbmbontroller_v2_2
mibmmbbo_v3_4
miterbcbmbi_v9_4
madc_wicbmbz_v3_4
mambmbuic_v12_2 mvxfftc_v9_4
The last name, mvxfftc_v9_4, needs to be changed into microsemi.com:ip:mvxfft:9.4, the same needs to be done to all names. How do I do that?

The problem is slightly under-specified, but I'm assuming that you are taking each name like this (marking the parts to be extracted):
mvxfftc_v9_4
^^^^^^ ^ ^
and transforming that to (marking the parts inserted):
microsemi.com:ip:mvxfft:9.4
^^^^^^ ^ ^
That's not too hard with regsub, and since we're changing a well-formed word to a well-formed word, we can probably just use it directly on the variable as string processing without needing to map it over the list explicitly:
set changed [regsub -all {\y(\w+)\w_v(\d+)_(\d+)\y} $original {microsemi.com:ip:\1:\2.\3}]
# \y is a beginning-or-end-of-word anchor
# \w means any alphanumeric-or-underscore
# \d means any digit
# \1, \2 and \3 are substituted by the parenthesised matches
There's a limit to the complexity of mappings that you can do this way, but maybe this will be enough.

Maybe you could try (assuming your list of names is in name):
foreach name $names {
puts microsemi.com:ip:[regsub {_v(\d)_(\d)$} $name {:\1.\2}]
}
It's not pretty, but it does what you seem to be saying that you want done.
If you want a list instead of a printout, you could use this (Tcl 8.6 or later, leaves result in result):
set result [lmap name $names {
lindex microsemi.com:ip:[regsub {_v(\d)_(\d)$} $name {:\1.\2}]
}]
or this (most versions):
set result {}
foreach name $names {
lappend result microsemi.com:ip:[regsub {_v(\d)_(\d)$} $name {:\1.\2}]
}
Documentation: foreach, lappend, lindex, lmap, lmap replacement, puts, regsub, set
On regular expression writing:
http://tcl-lang.org/man/tcl8.5/tutorial/Tcl20.html (and the two following)
http://wiki.tcl.tk/_//search?submit=Search&S=regular+expression&charset=UTF-8
There are also a few sites dedicated to regular expressions for various programming languages: a Google search should help you.

Delete multiple file for/while

I have a php pull down that I select an item and delete
all files associated with it.
It works well if there was only 5 or 6. After I put the
first 4 to test and get it working I realized it could
take a very long time to enter in a couple hundred and
would blot the script.
Not knowing enough about for and while loops is there
anyone that might have a way to help?
There will never be more than one set deleted at a time.
Thanks in advance.
<?php
$workitem = $_POST["workitem"];
$workdirPAth = "/var/work.files/";
if($workitem == 'item1.php')
{
unlink("$workdirPath/page1.php");
unlink("$workdirPath/temp1.php");
unlink("$workdirPath/all1.php");
}
if($workitem == 'item2.php')
{
unlink("$workdirPath/page2.php");
unlink("$workdirPath/temp2.php");
unlink("$workdirPath/all2.php");
}
if($workitem == 'item3.php')
{
unlink("$workdirPath/page3.php");
unlink("$workdirPath/temp3.php");
unlink("$workdirPath/all3.php");
}
if($workitem == 'item4.php')
{
unlink("$workdirPath/page4.php");
unlink("$workdirPath/temp4.php");
unlink("$workdirPath/all3.php");
?>

Some simple pattern matching and substitution is all you need here.
First, the code:
1. if (preg_match('/^item(\d+)\.php$/', $workitem, $matches)) {
2. $number = $matches[1];
3. foreach(array('page','temp','all') as $base) {
4. unlink("$workdirPath/$base$number.php");
5. }
6. } else {
7. # unrecognized work item value; complain to user or whatever
8. }
The preg_match function takes a pattern, a string, and an array. If the string matches the pattern, the parts that match are stored in the array. The particular type of pattern is a *p*erl5-compatible *reg*ular expression, which is where the preg_ part of the name comes from.
Regular expressions are scary-looking to the uninitiated, but they're a handy way to scan a string and get some values out of it. Most characters just represent themselves; the string "foo" matches the regular expression /foo/. But some characters have special meanings that let you make more general patterns to match a whole set of strings where you don't have to know ahead of time exactly what's in them.
The /s just mark the beginning and end of the actual regular expression; they're there because you can stick additional modifier flags inside the string along with the expression itself.
The ^and $ arepresent the beginning and end of the string. "/foo/" matches "foo", but also "foobar", "bunnyfoofoo", and so on - any string that contains "foo" will match. But /^foo$/ matches only "foo" exactly.
\d means "any digit". + means "one or more of that last thing". So \d+ means "one or more digits".
The period (.) is special; it matches any character at all. Since we want a literal period, we have to escape it with a backslash; \. just matches a period.
So our regular expression is '/^item\d+\.php$/', which will match any itemnumber.php filename. But that's not quite enough. The preg_match function is basically a binary test: does the string match the pattern or not, yes or no? In this case, it's not enough to just say "yup, the string is valid"; we need to know which items specifically the user specified. That's what capture groups are for. We use parentheses to say "remember what matched this part", and provide an array name that gets filled with those remembrances.
The part of the string that matches the whole regular expression (which may not be the whole string, if the regular expression isn't anchored with ^...$ like this one is) is always put in element 0 of the array. If you use parentheses in the regular expression, then the part of the string that matches the part of the regular expression inside the first pair of parentheses is stored in element 1 of the array; if there's a second set of parentheses, the matching part of the string goes in element 2 of the array, and so on.
So we put parentheses around our number ((\d+)) and then the actual number will be remembered in element 1 of our $matches array.
Great, we have a number. Now we just need to use it to build up the filenames we want to delete.
In each case, we want to delete three files: page$n.php, temp$n.php, and all$n.php, where $n is the number we extracted above. We could just put three unlink calls, but since they're all so similar, we can use a loop instead.
Take the different prefixes that are the same no matter the number, and make an array out of them. Then loop over that array. In the body of the loop, the variable $base will contain whichever element of the array it's currently on. Stick that between the $workdirPath prefix and the $number we got from the match, append .php, and that's your file. unlink it and go back to the top of the loop to grab the next one.

http_build_query function's excessive urlencoding

Why when building a query string with http_build_query function, it urlencodes square brackets [] outside values and how do get rid of it?
$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query($query, "", "&");
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString));
outputs:
var%5Bfoo%5D=value&var%5Bbar%5D=encodedBracket%5B
urldecoded: var[foo]=value&var[bar]=encodedBracket[
The function correctly urlencoded a [ in encodedBracket[ in the first line of the output but what was the reason to encode square brackets in var[foo]= and var[bar]=? As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B should have stayed as was for the query string to be correct and not become encodedBracket[.
According to section 2.2 Reserved Characters of Uniform Resource Identifier (URI): Generic Syntax
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm. If
data for a URI component would conflict with a reserved character's
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," /
";" / "="
So shouldn't http_build_query really produce more readable output with characters like [] urlencoded only where it's required? How do I make it produce such output?

Here's a quick function I wrote to produce nicer query strings. It not only doesn't encode square brackets but will also omit the array key if it matches the index.
Note it doesn't support objects or the additional options of http_build_query. The $prefix argument is used for recursion and should be omitted for the initial call.
function http_clean_query(array $query_data, string $prefix=null): string {
$parts = [];
$i = 0;
foreach ($query_data as $key=>$value) {
if ($prefix === null) {
$key = rawurlencode($key);
} else if ($key === $i) {
$key = $prefix.'[]';
$i++;
} else {
$key = $prefix.'['.rawurlencode($key).']';
}
if (is_array($value)) {
if (!empty($value)) $parts[] = http_clean_query($value, $key);
} else {
$parts[] = $key.'='.rawurlencode($value);
}
}
return implode('&', $parts);
}

I know this is a bit old, but I think it's still relevant today.
TL;DR: http_build_query() is working correctly
Longer explanation: Yes, http_build_query encodes [] and it looks awful... but it's the correct behavior: [] are reserved character as per rfc3986#section-2.3. And... no, they are NOT reserved for passing arrays!
What [] are reserved for is defined in rfc3986#section-3.2.2:
A host identified by an Internet Protocol literal address, version
6 [RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax. In
anticipation of future, as-yet-undefined IP literal address formats,
an implementation may use an optional version flag to indicate such a
format explicitly rather than rely on heuristic determination.
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
So basically it is reserved for something like https://[2607:f8b0:4004:808::200e]
There is another questione about this same topic here: https://stackoverflow.com/a/1016737/1204976

I found the following "fix" here:
[...] the workable 'fix' I have been using was to postprocess http_build_query() output
with the following - a 'solution' which makes my skin crawl just a little:
function http_build_query_unborker($s) {
return preg_replace_callback('#%5[bd](?=[^&]*=)#i', function($match) {
return urldecode($match[0]);
}, $s);
}
So now it would become:
$query = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$queryString = http_build_query_unborker(http_build_query($query, "", "&"));
var_dump($queryString);
var_dump("urldecoded: " . urldecode($queryString)); // var[foo]=value&var[bar]=encodedBracket%5B

You've got many questions here. Speaking in RFC terms of should, and reading your own questions in these same terms. I take your questions from bottom to top:
How do I make it produce such output?
By using a different encoder, Net_URL2 (pear / packagist) for example:
$vars = array("var" => array("foo" => "value", "bar" => "encodedBracket["));
$url = new Net_URL2('');
$url->setQueryVariables($vars);
$query = $url->getQuery();
var_dump($query); // string(41) "var[foo]=value&var[bar]=encodedBracket%5B"
So shouldn't http_build_query really produce more readable output with characters like [] urlencoded only where it's required?
No, it should not. Even it is not required to encode the square brackets inside the query part, it is recommended. That what is recommended should be done.
Next to that, the http_build_query() function is not about creating "more readable output". It is only about creating the query of an HTTP URI. For such a query part, square brackets should be percent-encoded. These are reserved characters not specifically allowed for query.
What was the reason to encode square brackets in var[foo]= and var[bar]=?
The reason to encode square brackets there is the same reason to encode square brackets in encodedBracket[. The differentiation you do between these parts in your question is purely syntactic on your own, within an URI these parts are treated equal. There are no sub-parts of a query part in an URI. So making a distinction between the bracket var[ or the bracket encodedBracket[ is purely unrelated to URI encoding of the query part.
As you say that the percent-encoding of encodedBracket[ to encodedBracket%5B is correct and as it belongs into the same part of the URI (the query part), logic dictates that you must accept that encoding the bracket in var[ to var%5B is equally correct in terms of URI encoding. Same URI part, same encoding. The only ending delimiter the query part has is "#".
Additionally your reasoning shows a misunderstanding in this part:
As you can see, urldecoding the string also decoded reserved characters in values, encodedBracket%5B should have stayed as was for the query string to be correct and not become encodedBracket[.
If you urldecode, all percent-encoded sequences will be decoded - regardless whether the percent-encoding was representing a reserved character or not. In terms of correct, it's the opposite of what you stated: %5B has to be decoded to [ regardless if it was at the beginning, in the middle or at the end of the string.
Why when building a query string with http_build_query function, it urlencodes square brackets [] outside values and how do get rid of it?
It's easier to answer the second part, see at the beginning of the answer, it's already answered.
About the why this perhaps might not be immediately visible especially as you might have found out that PHP itself accepts percent-encoded and verbatim square brackets in the query (even intermixed) without any problems.
How come the differences and why is that so? Is it really as simple as you outline it in your question? Is this a cosmetic difference only?
First of all, not encoding square brackets in the query query part of an URI violates RFC3986 in the sense that the query part should not contain the brackets from gen-delims characters unencoded. Non-percent-encoded square brackets can not be part of query according to the ABNF:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDI
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Getting rid of these therefore is not suggested (at least for encoding purposes following the standard) as it will change the URI:
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
This is already a good hint that for the URI you ask for, it has a different meaning than the URI PHP creates via the built-in function.
And further on:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.
This is not the case for all characters in gen-delims but per the ABNF:
"/" / "?" / ":" / "#"
So it therefore looks like that http_build_query() went the route to percent-encode square brackets as those are reserved characters and not specifically allowed by the URI scheme for that part (the query). Basically nothing wrong with it, it follows the recommendation of RFC3986. And it is not suggesting a different meaning for those parts of the query.
However you clearly say, that technically these brackets aren't delimiters in the query. And yes, that is true:
The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
So comparing to what has been identified earlier as reserved characters not specifically allowed:
"#" / "[" / "]"
(already a pretty small list) it should be clear that "#" must stay reserved otherwise the URI gets broken (a true, separating delimiter at the end of query), but the square brackets must not be specifically allowed when representing an unequal URI without data-loss and preserving all URI delimiters:
If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So if you can still follow me, one might want actually do what you're asking for: Creating an URI in which the square brackets meaning as a delimiter (e.g. representing a fraction of an array definition) but not having this as data. Albeit the data of the character is preserved per RFC 3986.
It therefore is technically possible to create an URI with the square brackets not percent encoded within the query. Technically even inside values, like it would be a syntactical difference outside of values, this is only another syntactic difference for inside of values.
This is also the reason why browsers preserve the state of square brackets within the query when you enter these into your browser. Percent-encoded or not - the browser passes that part of the URI as-is to the server so that the underlying processes on the server can benefit from syntactic differences that might have been expressed by that.
So choose the URL encoding correctly for the underlying platform. Only because it's possible, it must not mean it works in a stable manner. The way http_build_query() does is the most stable (safe) way following RFC 3986. However it's a should in the RFC, so if you understand this to the point, you can have valid reasons to not percent-encode the square brackets.
One reason you name in your question is readability. This is especially important when you write down URLs for example on a sheet of paper. I'm not so sure if a square bracket is such a good distinguishable character and if not percent encoding does even help with readability. But I have not tried it. PHP would accept both ways. But then you won't need to do that programmatically. So perhaps readability wasn't really the case in your scenario.

Explode and Implode in APL

How could functions similar to PHP's explode and implode be implemented with APL?
I tried to work it out myself and came up with a solution which I'm posting below. I'd like to see other ways that this might be solved.

Pé, the quest for "short" and/or "elegant" solutions to standard-problems in APL is older than PHP and even older than new terminology, such as "explode", "implode" (I think - but I must admit I do not know how old these terms really are...). Anyway, the early APL guys used the term "idiom" for such "solutions to standard problems that fit in one line of APL".
And for some reason, the Finns were especially creative and even started producing a list of these in order to make it easy for newbies. And I find this stuff still useful after 20yrs of doing APL. It is called "FinnAPL" - the Finnish APL idiom library and you can browse it here: https://aplwiki.com/wiki/FinnAPL_idiom_library (BTW, the whole APL Wiki might be interesting to read...)
You may, however, need to be creative with your wording in order to find solutions ;)
And one warning: FinnAPL only works with "classic" (non-nested) data-structures (nested matrices came with "APL2" which is standard these days), so some of the ways they handle data might no longer be "state-of-the-art". (i.e. back in the "old times", CAT BIRD and DOG would have been represented as a 3x4 array, so "implode" of string-array was a simple as ,array,delimeter (but you then had the challenge to remove blanks which were inserted for padding.
Anyway, I'm not sure why I wrote all this - just a few thoughts which came to mind when thinking about my start with APL ;-)
Ok, let me also look at the question. When your delimeter is a single character the APL2ish-idiomatic way of handling this would be something like this:
⎕ml←3 ⍝ "migration-level" (only Dyalog APL) to ensure APL2-compatibility
s←' '
A←s,'BIRD',s,'CAT',s,'DOG' ⍝ note that delimeter also used as 1st char!
exploded_string←1↓¨(+\A=s)⊂A ⍝ explode
imploded←∊s,¨exploded_string
A≡imploded ⍝ test for successfull round-trip should return 1

Explode:
Given the following text string and delimiter string:
F←'CAT BIRD DOG'
B←' '
Explode can be accomplished as follows:
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P[2] ⍝ returns BIRD
Limitations:
PHP's explode function returns a null array value when two delimiters are adjacent to each other. The code above simply ignores that and treats the two delimiters as if they were one.
The code above also does nothing to handle overlapping delimiters. This is most likely to occur if repeated characters are used for the delimiter. For example:
F←'CATaaaBIRDaaDOG'
B←'aa'
S←⍴,B
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂B⍷F)⊂F
P ⍝ returns CAT BIRD DOG
However, the expected result would be CAT aBIRD DOG because it doesn't recognize 'aaa' as the delimiter followed by 'a.' Rather, it treats it as two overlapping delimiters, which end up functioning as a single delimiter. Another example would be 'tat' as the delimiter, in which case, any occurence in the string of 'tatat' would have the same problem.
Overlapping Delimiters:
I have an alternative for the possibility of a single overlap:
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
The third line of code eliminates any string positions that occur within a distance of S-1 characters from any delimiter position before it. As I said, this only solves the problem for a single overlap. If there are two or more overlaps, the first is recognized as a delimiter, and all the rest are ignored. Here's an example of two overlaps:
F←'CATtatatatBIRDtatDOG'
B←'tat'
S←⍴,B
A←B⍷F
A←(2×A)>⊃+/(-S-⍳S)⌽¨S⍴⊂A
P←(⊃~∨/(-S-⍳S)⌽¨S⍴⊂A)⊂F
P ⍝ returns CAT atatBIRD DOG
The expected result was 'CAT a BIRD DOG,' but it is unable to recognize the final 'tat' as a delimiter because of the overlap. Such a situation would be rare except when repeated characters are used. If the delimiter is 'aa', then 'aaaa' would be considered a double overlap, and only the first delimiter would be recognized.
Implode:
Much simpler:
P←'CAT' 'BIRD' 'DOG'
B←'-'
(⍴,B)↓∊B,¨P
It returns 'CAT-BIRD-DOG' as expected.

An interesting alternative for implode can be accomplished with reduction:
p←'cat' 'bird' 'dog'
↑{⍺,'-',⍵}/p
cat-bird-dog
This technique does not need to explicitly reference the shape of the delimiter.
And an interesting alternative to explode can be done with n-wise reduction:
f←'CATtatBIRDtatDOG'
b←'tat'
b{(~(-⍴⍵)↑(⍴⍺)∨/⍺⍷⍵)⊂⍵}f
CAT BIRD DOG

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.