How to get NumberFormatter::parse() to only parse actual numeric strings?

How to get NumberFormatter::parse() to only parse actual numeric strings? - php

I’m trying to parse some strings in some messed-up CSV files (about 100,000 rows per file). Some columns have been squished together in some rows, and I’m trying to get them unsquished back into their proper columns. Part of the logic needed there is to find whether a substring in a given colum is numeric or not.
Non-numeric strings can be anything, including strings that happen to begin with a number; numeric strings are generally written the European way, with dots used for thousand separators and commas for decimals, so without going through a bunch of string replacements, is_numeric() won’t do the trick:
\var_dump(is_numeric('3.527,25')); // bool(FALSE)
I thought – naïvely, it transpires – that the right thing to do would be to use NumberFormatter::parse(), but it seems that function doesn’t actually check whether the string given as a whole is parseable as a numeric string at all – instead it just starts at the beginning and when it reaches a character not allowed in a numeric string, cuts off the rest.
Essentially, what I’m looking for is something that will yield this:
$formatter = new \NumberFormatter('de-DE', \NumberFormatter::DECIMAL);
\var_dump($formatter->parse('3.527,25')); // float(3527.25)
\var_dump($formatter->parse('3thisisnotanumber')); // bool(FALSE)
But all I can get is this:
$formatter = new \NumberFormatter('de-DE', \NumberFormatter::DECIMAL);
\var_dump($formatter->parse('3.527,25')); // float(3527.25)
\var_dump($formatter->parse('3thisisnotanumber')); // float(3)
I figured perhaps the problem was that the LENIENT_PARSE attribute was set to true, but setting it to false ($formatter->setAttribute(\NumberFormatter::LENIENT_PARSE, 0)) has no effect; non-numeric strings still get parsed just fine as long as they begin with a number.
Since there are so many rows and each row may have as many as ten columns that need to be validated, I’m looking at upwards of a million validations per file – for that reason, I would prefer avoiding a preg_match()-based solution, since a million regex match calls would be quite expensive.
Is there some way to tell the NumberFormatter class that you would like it to please not be lenient and only treat the string as parseable if the entire string is numeric?

You can strip all the separators and check if whatever remains is a numeric value.
function customIsNumeric(string $value): bool
{
return is_numeric(str_replace(['.', ','], '', $value));
}
Live test available here.

You can use is_numeric() to check that it is only numbers before parsing. But NumberFormatter does not do what you are looking for here.

Related

In PHP how can i manage strings with exponential numbers

i have a problem with some alphanumeric strings containing the exponential char "E", these are stored into the db in a "character varying" column, so are strings, but when i try to visualize them in the web page i get an "INF" string instead of the original.
For example the following "55E77583" (that for me must be only a code number of an order) becomes "INF" in the webpage.
i've tried to search a solution and i found the sprintf and printf commands, but after some tries with differents %char combinations i'm not able to obtain the original form of the string.
$code = "55E77583";
echo sprintf('%s', $code);
//Gives me "INF"
$code = "55E77583";
printf('%s', $code);
//Gives me always "INF"
I really need to obtain the original form of the string, always, in all the possible alphanumeric combinations. How can i do?
Thank you.

If I understand correctly you want to display the value 55E77583 as string on your webpage. Your provided code above will exactly do that.
So somehow your variable must be converted to double or float before, this is why you receive INF because the number is too large to handle with PHP.
Make sure your variable is actually a string by echoing
echo gettype($code);
This will very likely produce "double". Maybe a type conversion is happening during your select.

Shortest possible query string for a numerically indexed array in PHP

I’m looking for the most concise URL rather than the shortest PHP code. I don’t want my users to be scared by the hideous URLs that PHP creates when encoding arrays.
PHP will do a lot of repetition in query string if you just stuff an array ($fn) through http_build_query:
$fs = array(5, 12, 99);
$url = "http://$_SERVER[HTTP_HOST]/?" .
http_build_query(array('c' => 'asdf', 'fs' => $fs));
The resulting $url is
http://example.com/?c=asdf&fs[0]=5&fs[1]=12&fs[3]=99
How do I get it down to a minimum (using PHP or methods easily implemented in PHP)?

Default PHP way
What http_build_query does is a common way to serialize arrays to URL. PHP automatically deserializes it in $_GET.
When wanting to serialize just a (non-associative) array of integers, you have other options.
Small arrays
For small arrays, conversion to underscore-separated list is quite convenient and efficient. It is done by $fs = implode('_', $fs). Then your URL would look like this:
http://example.com/?c=asdf&fs=5_12_99
The downside is that you’ll have to explicitly explode('_', $_GET['fs']) to get the values back as an array.
Other delimiters may be used too. Underscore is considered alphanumeric and as such rarely has special meaning. In URLs, it is usually used as space replacement (e.g. by MediaWiki). It is hard to distinguish when used in underlined text. Hyphen is another common replacement for space. It is also often used as minus sign. Comma is a typical list separator, but unlike underscore and hyphen in is percent-encoded by http_build_query and has special meaning almost everywhere. Similar situation is with vertical bar (“pipe”).
Large arrays
When having large arrays in URLs, you should first stop coding a start thinking. This almost always indicates bad design. Wouldn’t POST HTTP method be more appropriate? Don’t you have any more readable and space efficient way of identifying the addressed resource?
URLs should ideally be easy to understand and (at least partially) remember. Placing a large blob inside is really a bad idea.
Now I warned you. If you still need to embed a large array in URL, go ahead. Compress the data as much as you can, base64-encode them to convert the binary blob to text and url-encode the text to sanitize it for embedding in URL.
Modified base64
Mmm. Or better use a modified version of base64. The one of my choice is using
- instead of +,
_ instead of / and
omits the padding =.
define('URL_BASE64_FROM', '+/');
define('URL_BASE64_TO', '-_');
function url_base64_encode($data) {
$encoded = base64_encode($data);
if ($encoded === false) {
return false;
}
return str_replace('=', '', strtr($encoded, URL_BASE64_FROM, URL_BASE64_TO));
}
function url_base64_decode($data) {
$len = strlen($data);
if (is_null($len)) {
return false;
}
$padded = str_pad($data, 4 - $len % 4, '=', STR_PAD_RIGHT);
return base64_decode(strtr($padded, URL_BASE64_TO, URL_BASE64_FROM));
}
This saves two bytes on each character, that would be percent-encoded otherwise. There is no need to call urlencode function, too.
Compression
Choice between gzip (gzcompress) and bzip2 (bzcompress) should be made. Do not want to invest time in their comparison, gzip looks better on several relatively small inputs (around 100 chars) for any setting of block size.
Packing
But what data should be fed into the compression algorithm?
In C, one would cast array of integers to array of chars (bytes) and hand it over to the compression function. That’s the most obvious way to do things. In PHP the most obvious way to do things is converting all the integers to their decimal representation as strings, then concatenation using delimiters, and only after that compression. What a waste of space!
So, let’s use the C approach! We’ll get rid of the delimiters and otherwise wasted space and encode each integer in 2 bytes using pack:
define('PACK_NUMS_FORMAT', 'n*');
function pack_nums($num_arr) {
array_unshift($num_arr, PACK_NUMS_FORMAT);
return call_user_func_array('pack', $num_arr);
}
function unpack_nums($packed_arr) {
return unpack(PACK_NUMS_FORMAT, $packed_arr);
}
Warning: pack and unpack behavior is machine-dependent in this case. Byte order could change between machines. But I think it will not be a problem in practice, because the application will not run on two systems with different endianity at the same time. When integrating multiple systems, though, the problem might arise. Also if you switch to a system with different endianity, links using the original one will break.
Encoding together
Now packing, compression and modified base64, all in one:
function url_embed_array($arr) {
return url_base64_encode(gzcompress(pack_nums($arr)));
}
function url_parse_array($data) {
return unpack_nums(gzuncompress(url_base64_decode($data)));
}
See the result on IdeOne. It is better than OP’s answer where on his 40-element array my solution produced 91 chars while his one 98. When using range(1, 1000) (generates array(1, 2, 3, …, 1000)) as a benchmark, OP’s solution produces 2712 characters while mine just 2032 characters. This is about 25 % better.
For the sake of completeness, OP’s solution is
function url_embed_array($arr) {
return urlencode(base64_encode(gzcompress(implode(',', $arr))));
}

There are multiple approaches possible:
serialize + base64 - can swallow any object, but data overhead is horrible.
implode + base64 - limited to arrays, forces user to find unused char as delimiter, data overhead is much smaller.
implode - unsafe for unescaped strings. Requires strict data control.
$foo = array('some unsafe data', '&&&==http://', '65535');
$ser = base64_encode(serialize($foo));
$imp = implode($foo, '|');
$imp2 = base64_encode($imp);
echo "$ser\n$imp\n$imp2";
Results are as follows:
YTozOntpOjA7czoxNjoic29tZSB1bnNhZmUgZGF0YSI7aToxO3M6MTI6IiYmJj09aHR0cDovLyI7aToyO3M6NToiNjU1MzUiO30=
some unsafe data|&&&==http://|65535
c29tZSB1bnNhZmUgZGF0YXwmJiY9PWh0dHA6Ly98NjU1MzU=
While serialize+base64 results are horribly long, implode+serialize gives output of manageable length with safety for GET… except for that = at end.

I believe the answer depends on the size of the query string.
Short query strings
For shorter query strings, this may be the best way:
$fs = array(5, 12, 99);
$fs_no_array = implode(',', $fs);
$url = "http://$_SERVER[HTTP_HOST]/?" .
http_build_query(array('c' => 'asdf', 's' => 'jkl')) . '&fs=' . $fs_no_array;
resulting in
http://example.com/?c=asdf&s=jkl&fs=5,12,99
On the other end you do this to get your array back:
$fs = array_map('intval', explode(',', $_GET['fs']));
Quick note about delimiters: A valid reasons to avoid commas is that they are used as delimiters in so many other applications. On the off-chance you may want to parse your URLs in Excel, for example, the commas might make it slightly more difficult. Underscores also would work, but can blend in with the underlining that is standard in web formatting for links. So dashes may actually be a better choice than either commas or underscores.
Long query strings
I came across another possible solution:
$fs_compressed = urlencode(base64_encode(gzcompress($fs_no_array)));
On the other end it can be decompressed by
$fs_decompressed = gzuncompress(base64_decode($_GET['fs']));
$fs = array_map('intval', explode(',', $fs_decompressed));
assuming it’s passed in through GET variable.
Effectivity tests
31 elements
$fs = array(7,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,52,53,54,61);
Result:
eJwFwckBwCAQxLCG%2FMh4D6D%2FxiIdpGiG5fLIR0IkRZoMWXLIJQ8%2FDIqFjYOLBy8jU0yz%2BQGlbxAB
$fs_no_array is 84 characters long, $fs_compressed 84 characters long. The same!
40 elements
$fs = array(7,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,52,53,54,61);
Result:
eJwNzEkBwDAQAzFC84jtPRL%2BxFoB0GJC0QyXhw4SMgoq1GjQoosePljYOLhw48GLL37kEJE%2FDCnSZMjSpkMXow%2BdIBUs
$fs_no_array is 111 characters long, $fs_compressed 98 characters long.
Summary
The savings is only about 10 %. But at greater lengths the savings will increase to beyond 50 %.
If you use Yahoo sites, you notice things like comma separated lists as well as sometimes a series of random looking characters. They may be employing these solutions in the wild already.
Also check out this stack question, which talks in way too much detail about what is allowed in a URI.

PHP Regex string returns two identical arrays

I've got a Regex query here to pull out all of the tags in a page. It looks like this:
preg_match_all('%<tr[^>]++>(.*?)</tr>%s', $pageText, $rows);
Problem is that while it does find all of the tags on the page in the return array it actually returns a multidimensional array, where each entry of the first array contains an array of all of the matches. In other words, it hands me multiple identical copies of the first array, IE the one I actually want.
Help please?
EDIT: Also relevant: I'm not allowed to use DOM for this application despite it being a significantly easier (and better) way of going about things.

What you're actually asking about is the $row[0] list, which redundantly contains the <tr>...</tr> blob again. If you just care about the (.*?) inner data, then use \K to reset the full match.
preg_match_all('=<tr\b[^>]*+>(.*?)</tr>\K=s', $pageText, $rows);
It's not possible to get rid of $row[0] completely. You'll have to ignore it, and use $row[1] alone.

Try this one:
preg_match_all('~<tr(?:\\s+[^>]*)?>(.*?)</tr>~si', $pageText, $rows);
var_dump($rows[1]);
Don't use % to wrap RegExps. It's a character somehow reserved for printf() like functions and with %s or %i at the end of your Pattern, it can be quite confusing.

php covert a Hexadecimal number 273ef9 into a path 27/3e/f9

As the title reads, what it is an effeicent way to covert a Hexadecimal number such as 273ef9 into a path such as 27/3e/f9 in PHP?
updated:::
actually, I want a unsual number convert to dexadecimal and furthr convert to a path....but may be we can skip the middle step.

How about combining a str_split with implode? Might not be super efficient but very readable:
implode('/',str_split("273ef9",2));
As a side note, this will of course work well with larger hex strings and can handle partial (3,5,7 in length) hex numbers (by just printing it as a single letter after the last slash).
Edit: With what you're asking now (decimal -> hex -> path), it would look like this:
$num = 2572025;
$hex = dechex($num);
implode('/',str_split($hex,2));
Of course, you can combine it for an even shorter but less readable representation:
implode('/',str_split(dechex($num),2));

The most efficient approach is to touch each character in the hex value exactly once, building up the string as you go. Because the string may have either an odd or even number of digits, you'll have to start with a check for this, outputting a single digit if it's an odd-length string. Then use a for loop to append groups of two digits, being careful with whether or not to add a slash. It will be a few lines of code.
Unless this code is being executed many millions of times, it probably isn't worth writing out this algorithm; Michael Petrov's is so readable and so nice. Go with this unless you have a real need to optimize.
By the way, to go from a decimal number to a hex string, just use dechex :)

php - Is strpos the fastest way to search for a string in a large body of text?

if (strpos(htmlentities($storage->getMessage($i)),'chocolate'))
Hi, I'm using gmail oauth access to find specific text strings in email addresses. Is there a way to find text instances quicker and more efficiently than using strpos in the above code? Should I be using a hash technique?

According to the PHP manual, yes- strpos() is the quickest way to determine if one string contains another.
Note:
If you only want to determine if a particular needle occurs within haystack,
use the faster and less memory intensive function strpos() instead.
This is quoted time and again in any php.net article about other string comparators (I pulled this one from strstr())
Although there are two changes that should be made to your statement.
if (strpos($storage->getMessage($i),'chocolate') !== FALSE)
This is because if(0) evaluates to false (and therefore doesn't run), however strpos() can return 0 if the needle is at the very beginning (position 0) of the haystack. Also, removing htmlentities() will make your code run a lot faster. All that htmlentities() does is replace certain characters with their appropriate HTML equivalent. For instance, it replaces every & with &
As you can imagine, checking every character in a string individually and replacing many of them takes extra memory and processor power. Not only that, but it's unnecessary if you plan on just doing a text comparison. For instance, compare the following statements:
strpos('Billy & Sally', '&'); // 6
strpos('Billy & Sally', '&'); // 6
strpos('Billy & Sally', 'S'); // 8
strpos('Billy & Sally', 'S') // 12
Or, in the worst case, you may even cause something true to evaluate to false.
strpos('<img src...', '<'); // 0
strpos('<img src...','<'); // FALSE
In order to circumvent this you'd end up using even more HTML entities.
strpos('<img src...', '<'); // 0
But this, as you can imagine, is not only annoying to code but gets redundant. You're better off excluding HTML entities entirely. Usually HTML entities is only used when you're outputting text. Not comparing.

strpos is likely to be faster than preg_match and the alternatives in this case, the best idea would be to do some benchmarks of your own with real example data and see what is best for your needs, although that may be overdoing it. Don't worry too much about performance until it starts to become a problem

strpos() return the begin position of first occurrence of string, if no match will return Null so statement is fairly usable.
if (!is_null(strpos($storage->getMessage($i),'chocolate'))) {}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get NumberFormatter::parse() to only parse actual numeric strings? - php

You can strip all the separators and check if whatever remains is a numeric value. function customIsNumeric(string $value): bool { return is_numeric(str_replace(['.', ','], '', $value)); } Live test available here.

You can use is_numeric() to check that it is only numbers before parsing. But NumberFormatter does not do what you are looking for here.

Related

In PHP how can i manage strings with exponential numbers

Shortest possible query string for a numerically indexed array in PHP

PHP Regex string returns two identical arrays

php covert a Hexadecimal number 273ef9 into a path 27/3e/f9

php - Is strpos the fastest way to search for a string in a large body of text?

Categories

Resources