I’m trying to parse some strings in some messed-up CSV files (about 100,000 rows per file). Some columns have been squished together in some rows, and I’m trying to get them unsquished back into their proper columns. Part of the logic needed there is to find whether a substring in a given colum is numeric or not.
Non-numeric strings can be anything, including strings that happen to begin with a number; numeric strings are generally written the European way, with dots used for thousand separators and commas for decimals, so without going through a bunch of string replacements, is_numeric() won’t do the trick:
\var_dump(is_numeric('3.527,25')); // bool(FALSE)
I thought – naïvely, it transpires – that the right thing to do would be to use NumberFormatter::parse(), but it seems that function doesn’t actually check whether the string given as a whole is parseable as a numeric string at all – instead it just starts at the beginning and when it reaches a character not allowed in a numeric string, cuts off the rest.
Essentially, what I’m looking for is something that will yield this:
$formatter = new \NumberFormatter('de-DE', \NumberFormatter::DECIMAL);
\var_dump($formatter->parse('3.527,25')); // float(3527.25)
\var_dump($formatter->parse('3thisisnotanumber')); // bool(FALSE)
But all I can get is this:
$formatter = new \NumberFormatter('de-DE', \NumberFormatter::DECIMAL);
\var_dump($formatter->parse('3.527,25')); // float(3527.25)
\var_dump($formatter->parse('3thisisnotanumber')); // float(3)
I figured perhaps the problem was that the LENIENT_PARSE attribute was set to true, but setting it to false ($formatter->setAttribute(\NumberFormatter::LENIENT_PARSE, 0)) has no effect; non-numeric strings still get parsed just fine as long as they begin with a number.
Since there are so many rows and each row may have as many as ten columns that need to be validated, I’m looking at upwards of a million validations per file – for that reason, I would prefer avoiding a preg_match()-based solution, since a million regex match calls would be quite expensive.
Is there some way to tell the NumberFormatter class that you would like it to please not be lenient and only treat the string as parseable if the entire string is numeric?
You can strip all the separators and check if whatever remains is a numeric value.
function customIsNumeric(string $value): bool
{
return is_numeric(str_replace(['.', ','], '', $value));
}
Live test available here.
You can use is_numeric() to check that it is only numbers before parsing. But NumberFormatter does not do what you are looking for here.
I’m looking for the most concise URL rather than the shortest PHP code. I don’t want my users to be scared by the hideous URLs that PHP creates when encoding arrays.
PHP will do a lot of repetition in query string if you just stuff an array ($fn) through http_build_query:
$fs = array(5, 12, 99);
$url = "http://$_SERVER[HTTP_HOST]/?" .
http_build_query(array('c' => 'asdf', 'fs' => $fs));
The resulting $url is
http://example.com/?c=asdf&fs[0]=5&fs[1]=12&fs[3]=99
How do I get it down to a minimum (using PHP or methods easily implemented in PHP)?
Default PHP way
What http_build_query does is a common way to serialize arrays to URL. PHP automatically deserializes it in $_GET.
When wanting to serialize just a (non-associative) array of integers, you have other options.
Small arrays
For small arrays, conversion to underscore-separated list is quite convenient and efficient. It is done by $fs = implode('_', $fs). Then your URL would look like this:
http://example.com/?c=asdf&fs=5_12_99
The downside is that you’ll have to explicitly explode('_', $_GET['fs']) to get the values back as an array.
Other delimiters may be used too. Underscore is considered alphanumeric and as such rarely has special meaning. In URLs, it is usually used as space replacement (e.g. by MediaWiki). It is hard to distinguish when used in underlined text. Hyphen is another common replacement for space. It is also often used as minus sign. Comma is a typical list separator, but unlike underscore and hyphen in is percent-encoded by http_build_query and has special meaning almost everywhere. Similar situation is with vertical bar (“pipe”).
Large arrays
When having large arrays in URLs, you should first stop coding a start thinking. This almost always indicates bad design. Wouldn’t POST HTTP method be more appropriate? Don’t you have any more readable and space efficient way of identifying the addressed resource?
URLs should ideally be easy to understand and (at least partially) remember. Placing a large blob inside is really a bad idea.
Now I warned you. If you still need to embed a large array in URL, go ahead. Compress the data as much as you can, base64-encode them to convert the binary blob to text and url-encode the text to sanitize it for embedding in URL.
Modified base64
Mmm. Or better use a modified version of base64. The one of my choice is using
- instead of +,
_ instead of / and
omits the padding =.
define('URL_BASE64_FROM', '+/');
define('URL_BASE64_TO', '-_');
function url_base64_encode($data) {
$encoded = base64_encode($data);
if ($encoded === false) {
return false;
}
return str_replace('=', '', strtr($encoded, URL_BASE64_FROM, URL_BASE64_TO));
}
function url_base64_decode($data) {
$len = strlen($data);
if (is_null($len)) {
return false;
}
$padded = str_pad($data, 4 - $len % 4, '=', STR_PAD_RIGHT);
return base64_decode(strtr($padded, URL_BASE64_TO, URL_BASE64_FROM));
}
This saves two bytes on each character, that would be percent-encoded otherwise. There is no need to call urlencode function, too.
Compression
Choice between gzip (gzcompress) and bzip2 (bzcompress) should be made. Do not want to invest time in their comparison, gzip looks better on several relatively small inputs (around 100 chars) for any setting of block size.
Packing
But what data should be fed into the compression algorithm?
In C, one would cast array of integers to array of chars (bytes) and hand it over to the compression function. That’s the most obvious way to do things. In PHP the most obvious way to do things is converting all the integers to their decimal representation as strings, then concatenation using delimiters, and only after that compression. What a waste of space!
So, let’s use the C approach! We’ll get rid of the delimiters and otherwise wasted space and encode each integer in 2 bytes using pack:
define('PACK_NUMS_FORMAT', 'n*');
function pack_nums($num_arr) {
array_unshift($num_arr, PACK_NUMS_FORMAT);
return call_user_func_array('pack', $num_arr);
}
function unpack_nums($packed_arr) {
return unpack(PACK_NUMS_FORMAT, $packed_arr);
}
Warning: pack and unpack behavior is machine-dependent in this case. Byte order could change between machines. But I think it will not be a problem in practice, because the application will not run on two systems with different endianity at the same time. When integrating multiple systems, though, the problem might arise. Also if you switch to a system with different endianity, links using the original one will break.
Encoding together
Now packing, compression and modified base64, all in one:
function url_embed_array($arr) {
return url_base64_encode(gzcompress(pack_nums($arr)));
}
function url_parse_array($data) {
return unpack_nums(gzuncompress(url_base64_decode($data)));
}
See the result on IdeOne. It is better than OP’s answer where on his 40-element array my solution produced 91 chars while his one 98. When using range(1, 1000) (generates array(1, 2, 3, …, 1000)) as a benchmark, OP’s solution produces 2712 characters while mine just 2032 characters. This is about 25 % better.
For the sake of completeness, OP’s solution is
function url_embed_array($arr) {
return urlencode(base64_encode(gzcompress(implode(',', $arr))));
}
There are multiple approaches possible:
serialize + base64 - can swallow any object, but data overhead is horrible.
implode + base64 - limited to arrays, forces user to find unused char as delimiter, data overhead is much smaller.
implode - unsafe for unescaped strings. Requires strict data control.
$foo = array('some unsafe data', '&&&==http://', '65535');
$ser = base64_encode(serialize($foo));
$imp = implode($foo, '|');
$imp2 = base64_encode($imp);
echo "$ser\n$imp\n$imp2";
Results are as follows:
YTozOntpOjA7czoxNjoic29tZSB1bnNhZmUgZGF0YSI7aToxO3M6MTI6IiYmJj09aHR0cDovLyI7aToyO3M6NToiNjU1MzUiO30=
some unsafe data|&&&==http://|65535
c29tZSB1bnNhZmUgZGF0YXwmJiY9PWh0dHA6Ly98NjU1MzU=
While serialize+base64 results are horribly long, implode+serialize gives output of manageable length with safety for GET… except for that = at end.
I believe the answer depends on the size of the query string.
Short query strings
For shorter query strings, this may be the best way:
$fs = array(5, 12, 99);
$fs_no_array = implode(',', $fs);
$url = "http://$_SERVER[HTTP_HOST]/?" .
http_build_query(array('c' => 'asdf', 's' => 'jkl')) . '&fs=' . $fs_no_array;
resulting in
http://example.com/?c=asdf&s=jkl&fs=5,12,99
On the other end you do this to get your array back:
$fs = array_map('intval', explode(',', $_GET['fs']));
Quick note about delimiters: A valid reasons to avoid commas is that they are used as delimiters in so many other applications. On the off-chance you may want to parse your URLs in Excel, for example, the commas might make it slightly more difficult. Underscores also would work, but can blend in with the underlining that is standard in web formatting for links. So dashes may actually be a better choice than either commas or underscores.
Long query strings
I came across another possible solution:
$fs_compressed = urlencode(base64_encode(gzcompress($fs_no_array)));
On the other end it can be decompressed by
$fs_decompressed = gzuncompress(base64_decode($_GET['fs']));
$fs = array_map('intval', explode(',', $fs_decompressed));
assuming it’s passed in through GET variable.
Effectivity tests
31 elements
$fs = array(7,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,52,53,54,61);
Result:
eJwFwckBwCAQxLCG%2FMh4D6D%2FxiIdpGiG5fLIR0IkRZoMWXLIJQ8%2FDIqFjYOLBy8jU0yz%2BQGlbxAB
$fs_no_array is 84 characters long, $fs_compressed 84 characters long. The same!
40 elements
$fs = array(7,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,52,53,54,61);
Result:
eJwNzEkBwDAQAzFC84jtPRL%2BxFoB0GJC0QyXhw4SMgoq1GjQoosePljYOLhw48GLL37kEJE%2FDCnSZMjSpkMXow%2BdIBUs
$fs_no_array is 111 characters long, $fs_compressed 98 characters long.
Summary
The savings is only about 10 %. But at greater lengths the savings will increase to beyond 50 %.
If you use Yahoo sites, you notice things like comma separated lists as well as sometimes a series of random looking characters. They may be employing these solutions in the wild already.
Also check out this stack question, which talks in way too much detail about what is allowed in a URI.
I'm working through some more PHP tutorials, specifically DevZone PHP 101, and am confused by:
echo .sprintf("%4.2f", (2 * $radius * pi()))
I found this
I think that means produce a floating-point field four positions wide with two decimal places, using the value of the first succeeding parameter.
That comes from the C/C++ line of programming languages. an sprintf() takes the first parameter as a format statement. Anything in it starting with a % is a field specifier; anything else is just printable text. So if you give a format statement with all text and no specifiers, it will print exactly the way it appears. With format specifiers, it needs data to work on.
But after trying some different values I'm still not getting it. It seems to me if the purpose of it in this case is just to limit the decimal to 2 places all I have to put is
.sprintf("%.2f", (2 * $radius * pi()))
What is the point of the 4 in the front of it? In the PHP Manual it leads me to believe it determines the total number of characters should be 4 but (a) thats not the case since the decimal point makes it 5 characters and (b) thats not the case because I tried changing it to a larger number like %8.2f and it didn't tack any zeros on to either end. Could someone please better explain this.
Thanks!
The first number %8.2f in the format specifier is for the filling length. Per default sprintf uses the space character.
You can see the effect with larger numbers:
printf("%20.2f", 1.23);
Will for example lead to:
1.23
There's 16 spaces before the number. The float takes up 4, and the fill length was set to 20 for instance. (Maybe you printed it out into the webpage, thus no padding spaces were visible..)
And there's an example further below on the sprintf manpage to use alternative padding characters:
printf("%'*20.2f", 1.23); // use the custom padding character '*'
Will result in:
****************1.23
I am looking for the best way to convert a MongoDB id 504aaedeff558cb507000004 into a shorter representation in PHP? Basically, users can reference id's in the app, and that long string is difficult.
The one caveat is, collisions should be 'rare'. Can we somehow get it down to 4, 5 or 6 characters?
Thanks.
While a hex digit can store 16 different states, a base64 encoded digit can store 64 different states, so you can store your whole MongoDB Id in 16 digits instead of 24 without losing any information:
print hexToBase64("50b3701de3de2a2416000000") . "\n"; # -> ULNwHePeKiQWAAAA
print base64ToHex("ULNwHePeKiQWAAAA") . "\n"; # -> 50b3701de3de2a2416000000
function base64ToHex($string) {
return bin2hex(base64_decode($string));
}
function hexToBase64($string) {
return base64_encode(hex2bin($string));
}
Your unique ID to start with can be mapped by [0-9a-f]. Shortening can be done in multiple ways - one easy way is to re-map character sets.
Our aim will be to cut the string size in two by replacing characters. A single character is one of 16, so two characters gives you 16^2 = 256 possibilities... I'm sure you know where I'm going with this. Take each couple of characters in your string, and calculate the mapping value. Generate the ASCII character corresponding, and use this instead. If you dislike having such an ugly ID at the end, base64-encode it - you'll get a string which is roughly 1/3 shorter than the one you started with.
I have to replace xmlns with ns in my incomming xml in order to fix SimpleXMLElements xpath() function. Most functions do not have a performance problem. But there allways seems to be an overhead as the string grows.
E.g. preg_replace on a 2 MB string takes 50ms to process, even if I limit the replaces to 1 and the replace is done at the very beginning.
If I substr the first few characters and just replace that part it is slightly faster. But not really that what I want.
Is there any PHP method that would perform better in my problem? And if there is no option, could a simple php extension help, that just does Replace => SimpleXMLElement in C?
If you know exactly where the offending "x", "m" and "l" are, you can just use something like $xml[$x_pos] = ' '; $xml[$m_pos] = ' '; $xml[$l_pos] = ' ' to transform them into spaces. Or transform them into ns___ (where _ = space).
You're always going to get an overhead when trying to do this - you're dealing with a char array and trying to do replace multiple matching elements of the array (i.e. words).
50ms is not much of an overhead, unless (as I suspect) you're trying to do this in a loop?
50ms sounds pretty reasonable to me, for something like this. The requirement itself smells of something being wrong.
Is there any particular reason that you're using regular expressions? Why do people keep jumping to the overkill regex solution?
There is a bog-standard string replace function called str_replace that may do what you want in a fraction of the time (though whether this is right for you depends on how complex your search/replace is).
From the PHP source, as we can see, for example here:
http://svn.php.net/repository/php/php-src/branches/PHP_5_2/ext/standard/string.c
I don`t see, any copies, but I'm not expert in C. From the other hand we can see there many convert to string calls, which at 1st sight could copy values. If they copy values, then we in trouble here.
Only if we in trouble
Try to invent some str_replace wheel here with the help of string-by-char processing. For example we have string $somestring = "somevalue". In PHP we could work with it's chars by indexes as echo $somestring{0}, which will give us "s" or echo $somestring{2} which will give us "m". I'm not sure in this way, but it's possible, if official implimentations don't use references, as they should use.