PHP can Strlen be trusted for mission critical banking applications? - php

i am wondering when i check the length of the string with strlen function in php
it says its 32. however when i open that in a notepad or notepad++ i can see some empty spaces in the file (incorrect alignment with the other lines,
since i am doing a banking application strlen is critical to be 100% correct.
i am just wondering should i keep trusting this output or should i moved to something else like
mb_strlen
what is the best function to be trusted in normal ASCII character set ?

If you want to use strlen() or mb_strlen() depends on your servers setup, if it uses the mb_string extension or not. Though usually strlen() is overloaded with mb_strlen() anyway in those cases. Apart from that the computed values are 100% correct.
The difference you might see is how text (especially multibyte strings) is visualized if you open it in a text displaying application. It is much more likely that you get a missleading information there.

Related

Overriding PHP's Default String Functions with mb_string functions

So I've posted several questions related to making already existing software written in PHP to be updated to support unicode / utf8. One of the solutions is to override PHP's default string functions with PHP's mb_string functions. However, I see a lot of people talking about negative consequences, yet no one really elaborates on them. Can someone please explain what these negative consequences are?
Why is it "bad" to override PHP's default string functions with its mb_string functions? It's after all much simpler than replacing all those functions with their corresponding mb_ functions manually. So what am I missing? What are these negative consequences?
It's bad to override them because if some other developer comes and works on this code then it might do something that he wasn't expecting. It's always good to use the default functions as they were intended.
I think mb_* family function are heavier as they also perform unicode test as well even of simple ascii string. So on big scale they will slow down your application speed. (May not be on much significance, but somehow definitely.)
I'll try to elaborate.
Overloading the standard string functions with mb_* will have dire consequences for anything reading and dealing with binary files, or binary data in general. If you overload the standard function, then suddenly strlen($binData) is bound to return the wrong length at some point.
Why?
Imagine the binary data contains a byte with the value in the ranges 0xC0-0xDF, 0xE0-0xEF or 0xF0-0xF7. Those are Unicode start bytes, and now the overloaded strlen will count the following characters as 1 byte, rather than the 2, 3, and 4 they should have been respectively.
And the main problem is that mbstring.func_overload is global. It doesn't just affect your own script, but all scripts, and any frameworks or libraries they may use.
When asked, should I enable mbstring.func_overload. The answer is always, and SHOULD always be a resounding NO.
You are royally screwed if you use it, and you will spend countless hours hunting bugs. Bugs that may very well be unfixable.
Well, you CAN call mb_strlen($string, 'latin1') to get it to behave, but it still contains an overhead. strlen uses the fact that php strings are like Java strings; they know their own length. mb_strlen parses the string to count the bytes.

Updating PHP CMS Site to fully support unicode / utf8

Some years ago, I built a good custom PHP CMS Site, but I overlooked one important issue: unicode support. This was primarily due to the fact that at the time, the users were English-speaking and that was to remain the case for the foreseeable future. Another factor was PHP's poor unicode support to begin with.
Well, now the day of reckoning has come. I want there to be support for unicode, specifically UTF8, but I have one major obstacle: PHP's string functions. Correct me if I'm wrong, but even now, in the world of PHP 5.5, PHP's regular string functions (i.e. strlen, substr, str_replace, strpos, etc) do not fully support unicode. On the other hand, PHP's mb_string functions do support unicode, but I have read that they may be rather resource heavy (which makes sense since we would be dealing with multibyte characters as opposed to single byte characters).
So, the way I see it, there are three solutions:
1) Use multibyte string functions in all cases.
A. Try to override the standard string functions with their multibyte counterparts. Speaking of which, were I to do this, what is the best way to do so?
B. Painstakingly go through all my code and replace the standard string functions with their multibyte function counterparts.
2) Painstakingly go through all my code and replace standard string functions that would work with user input, database data, etc with their multibyte function counterparts. This would require me to look at every usage of every string function carefully in my code to determine whether it has even the slightest chance of dealing with multibyte characters.
The benefit of this is that I would have the optimal running time while at the same time fully supporting unicode. The drawback here is that this would be very time-consuming (and extremely boring, I might add) to implement and there would always be the chance I'd miss using a multibyte string function where I should.
3) Overhaul my software entirely and start from scratch. But this is something I'm trying to avoid.
If there are other options available, please let me know.
I'd go for a variation of 1.B:
1.B.2) Use an automatical "Search and Replace" function (a single carefully crafted sed command might do it).
Reason for 1 in favor of 2: premature optimization is the root of all evil. I don't know where you read that the mb_ functions were "resource heavy" but plainly spoken it's utter nonsense. Of course they take a few more CPU cycles but that is a dimension that you really should not worry about. For some reason PHP developers love to discuss about such micro optimization like "are single quotes faster than double quotes" while they should focus on the things that really make a difference (mostly I/O and database). Really, it's not worth any effort.
Reason for automation: it's possible, it's more efficient, do you need more arguments?

PHP utf-8 best practices and risks for distributed web applications

I have read several things about this topic but still I have doubts I want to share with the community.
I want to add a complete utf-8 support to the application I developed, DaDaBIK; the application can be used with different DBMSs (such as MySQL, PostgreSQL, SQLite). The charset used in the databases can be ANY. I cant' set or assume the charset.
My approach would be convert, using iconv functions, everything i read from the db in utf-8 and then convert it back in the original charset when I have to write to the DB. This would allow me to assume I'm working with utf-8.
The problem, as you probably know, is that PHP doesn't support utf-8 natively and, even assuming to use mbstring, there are (according to http://www.phpwact.org/php/i18n/utf-8) several PHP functions which can create problems with utf-8 and DON't have an mbstring correspondance, for example the PREG extension, strcspn, trim, ucfirst, ucwords....
Since I'm using some external libraries such as adodb and htmLawed I can't control all the source code...in those libraries there are several cases of usage of those functions....do you have any advice about? And above all, how very popular applications like wordpress and so on are handling this (IMHO big) problem? I doubt they don't have any "trim" in the code....they just take the risk (data corruption for example) or there is something I can't see?
Thanks a lot.
First of all: PHP supports UTF-8 just fine natively. Only a few of the core functions dealing with strings should not be used on multi-byte strings.
It entirely depends on the functions you are talking about and what you're using them for. PHP strings are encoding-less byte arrays. Most standard functions therefore just work on raw bytes. trim just looks for certain bytes at the start and end of the string and trims them off, which works perfectly fine with UTF-8 encoded strings, because UTF-8 is entirely ASCII compatible. The same goes for str_replace and similar functions that look for characters (bytes) inside strings and replace or remove them.
The only real issue is functions that work with an offset, like substr. The default functions work with byte offsets, whereas you really want a more intelligent character offset, which does not necessarily correspond to bytes. For those functions an mb_ equivalent typically exists.
preg_ supports UTF-8 just fine using the /u modifier.
If you have a library which uses, for instance, substr on a potential multi-byte string, use a different library because it's a bad library.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for some more in-depth discussion and demystification about PHP and character sets.
Further, it does not matter what the strings are encoded as in the database. You can set the connection encoding for the database, which will cause it to convert everything for you and always return you data in the desired client encoding. No need for iconverting everything in PHP.

Is mb_strlen a suitable replacement for iconv_strlen

That is, if I'm coding something entirely in PHP4? Or perhaps I should use a custom function or class/extension instead to count the number of characters in a multibyte string?
Only difference I can spot is that mb_string strips out bad sequences, while iconv_strlen doesn't.
If you want a drop-in replacement for plain strlen, use mb_strlen as it always returns an int. This is very debatable though (iconv's correctness over mb's tolerance), but in practice mb_strlen's fault tolerance served me better. Just make sure you configure mb to the desired encoding either in php.ini or in a central place in your application.
Unicode support in PHP is in a bad place, you have to be aware of many pitfalls and exceptions. Having done a complete switch of several large applications and their user data to UTF-8, I could cry you a river.

Removing characters from a PHP String

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.
Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.
You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.
That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.
If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters
Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.
Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.
Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)

Categories