PHP String Length Without strlen() - php

Just browsing over the latest release of the PHP coding standards, and something caught my eye:
http://svn.php.net/viewvc/php/php-src/trunk/CODING_STANDARDS?revision=296679&view=markup
Coding standard #4 states that "When writing functions that deal with strings, be sure to remember that PHP holds the length property of each string, and that it shouldn't be calculated with strlen()..."
I've ALWAYS used strlen, and maybe it's just late, but how do you access the built-in length property of a string in PHP?

They're talking about the C function, not the PHP function. The C function will stop counting after the first \0, but PHP strings can contain \0 elsewhere other than the end.

Its been clearly mentioned that they talking about PHP Coding standards not about C function or extension of PHP engines.
======================== PHP Coding Standards========================
This file lists several standards that any programmer, adding or changing
code in PHP, should follow. Since this file was added at a very late
stage of the development of PHP v3.0, the code base does not (yet) fully
follow it, but it's going in that general direction. Since we are now
well into the version 4 releases, many sections have been recoded to use
these rules.
Still I didn't found any relevant information about string length property but I think in future they might release the information if it's related to new version of PHP.
Please post if someone found useful information about this.

To get the length of a string zval (variable in C) use Z_STRLEN(your_zval)
see zend_operators.h at line 398 (PHP 5.4) :
#define Z_STRLEN(zval) (zval).value.str.len

Related

What is the difference between a string and a "binary" string in PHP?

In PHP, you can (since PHP 5.2.1) use "binary strings":
$binary = (binary) $string;
$binary = b"binary string";
What is the difference with a "normal" string?
The only meaningful insight I could find was this comment:
However, it will only have effect as of PHP 6.0.0, as noted on http://www.php.net/manual/en/function.is-binary.php .
The link is dead. It would actually make sense that binary strings were added in PHP while PHP 6.0 was being developed, since 6.0 was supposed to bring Unicode support. So it was sort of a premature feature.
However is there an official source that could confirm that? I.e. confirm that there is absolutely no difference between classic strings and binary strings?
I don't have any official source to back this up, but I believe the reason is simple:
The present PHP treats strings as byte arrays, i.e. as raw binary blobs. PHP 6 was slated to be this great new release with its biggest improvement being native Unicode handling. At that point, a string literal would actually be understood as a string of characters instead as a string of bytes. Many string handling functions would break because of this and a lot of code would need to be retrofitted to continue to work in PHP 6.
As a migration path, strings could be declared as binary strings to keep the current behaviour. This was added early on to give developers ample time to prepare their code for PHP 6 compatibility. At the moment b doesn't do anything, the changed behaviour would only show up in PHP 6.
Well, PHP 6 never happened and is dead for now. So, b continues to do nothing for the time being and for now it's questionable if it will ever have any specific use in the future.

Change the comment character in Boost Program Options?

I have an app that has components in both PHP and C++. They need to share some configuration options, and I'd like to use one file to share these -- a simple config file.
Fortunately, PHP has parse_ini_file() and Boost has Program Options and they share virtually identical semantics. They can both can read all the options I need.
The one "gotcha" here is that PHP's function supports semicolon (";") as the comment character, and Boost supports hash ("#"). PHP used to support hash, but now it throws a deprecated error on it.
I'm pretty sure I can't easily change the comment character in PHP. Anyone know if I can change the Boost comment character? I'd love to not have to rewrite all this functionality just for comments.
Figured out a solution to this problem.
Given that Boost is reasonably robust, I couldn't see a reasonable way to replace the comment character, and the # is a fairly accepted comment character in config files, I solved it in PHP.
I load the config file using file_get_contents, use a preg_replace to remove the lines that begin with #, then pass the result through parse_ini_string.

PHP string functions vs mbstring functions

I have an application that has so far been in English only. Content encoding throughout templates and database has been UTF-8. I am now looking to internationalize/translate the application into languages that have character sets absolutely needing UTF-8.
The application uses various PHP string functions such as strlen(), strpos(), substr(), etc, and my understanding is that I should switch these for multi-byte string functions such as mb_strlen(), mb_strlen(), mb_substr(), etc, in order for multi-byte characters to be handled correctly. I've tried to read around this topic a little but virtually everything I can find goes deep into "encoding theory" and doesn't provide a simple answer to the question: If I'm using UTF-8 throughout, can I switch from using strlen() to mb_strlen() and expect things to work normally in for example both English and Arabic, or is there something else I still need to look out for?
Any insight would be welcome, and apologies if I'm offending someone who has encoding close to their heart with my relative ignorance.
No. Since bytearrays are also strings in PHP, a simple replacement of the 8-bit string functions with their mb_* counterparts will cause nothing but trouble. Functions like strlen() and substr() are probably more frequently used with bytes than actual text strings.
At the place I last worked, we managed to build a multilingual web-site (Arabic, Hindi, among other languages) in PHP without using the mbstring library at all. Text string manipulation actually doesn't happen that often. When it does, it would require far more care than just changing a function name. Most of the challenges, I've found, lie on the HTML side. Getting a page layout to work with a RTL language is the non-trivial part.
I don't know if you're just using Arabic as an example. The difficulty of internationalization can vary quite substantially depending on whether "international" means European languages only (plus Russian), or if it's inclusive of Middle-Eastern, South-Asian, and Far-East languages.
Check the status of the mbstring.func_overload flag in php.ini
If (ini_get('mbstring.func_overload') & 2) then functions like strlen() (as listed here) are already overloaded by the mb_strlen() function, so there is no need for you to call the mb_* functions explicitly.
The number of multibyte functions really needed are under 10, so create 3 or 5 questions whether the usage of the function or logic is good. This quesiton is obsecure and hard to answer. Small questions can get quick answers. Concrete questions can bring out good answers. let me know when you create other questions.
If you need use cases, see the fallback functions in CMSes such as Wordpress, MediaWiki, Drupal.
When you decide to start using mbstring, You should avoid using mbstring.func_overload directive. Mbstring maintainers are going to deprecate mbstring.func_overload in PHP 5.5 or 5.6 (see PHP core mailing list in 2012 April). mbstring.func_overload breaks the codebases that are not expected to use mbstring.func_overload. you can see the cases in CakePHP, Zend Framework 1x in caliculating Content-Length by using strlen().
I answerd the similar question in another place: Should i refactor all my framework to use mbstring functions?

Ruby equivalent to PHP's utf8_encode and utf8_decode functions [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I convert a string from windows-1252 to utf-8 in Ruby?
How can i transform the utf8 chars to iso8859-1
So here's my problem. I have tools I've distributed across my internal LAN and my external web servers serve up jobs/data to the internal LAN. As a result I'm often passing data I'd rather snoopy people didn't see. In other words, it's not the end of the world if someone sees my SQL string but I'd rather they didn't. I know, it's like a deadbolt on a sliding glass door. It won't keep anyone truly determined out but it should discourage the random curious script kiddie.
So I have a set of simple ciphers I've written in PHP. Recently I've determined that my next extension to my toolset needs to be in Ruby but those new tools need to communicate with my previously built set of PHP tools - I don't want to rebuild all of the PHP tools. So I need my PHP ciphers to be exactly reproduced by my Ruby code so that when Ruby encrypts a string my PHP tools and decipher it, then pass back an encrypted string for my Ruby tools to decipher.
My simple ciphers are just modified Caesar ciphers. A Caesar cipher (for those unfamiliar with the name) is where you shift all characters by a single known number of letters - i.e. a shift of 3 turn A into D, B into E, C into F, etc. A true Caesar cipher would require wrapping so that a shift of 3 would turn a Z into a C. However, mine doesn't do that, it simply adds 3 to Z and uses the utf8_encode and utf8_decode functions in PHP.
Now I need an equivalent in Ruby. I thought I'd found it in
str.encode('utf-8')
But that returns this error
undefined method 'encode' for #
My Googling suggests there is no single solution to this in Ruby for some reason. Ruby needs to know the current encoding of the string before encoding it into UTF-8. At least that's the way I understood the issue.
So the string coming in would be whatever Ruby 1.8.7 defaults to. (In case this is useful... I'm using Ubuntu 12.04 desktop, US, English and I think I grabbed Ruby using apt-get with the default repositories.) I need a variety of strings like SQL query statements - "SELECT * FROM table WHERE id = ?" and strings produced by PHP's md5() output. Every other string should fall into the category of upper case letters, lower case letters, and numbers all of it in US English.
Thanks
Check out Encoding and look at the documentation for the encode method you're using.I believe it is the first form that you need.

how to check if a php file is obfuscated?

is there any way we can check if a php file has been obfuscated, using php? I was thinking regex maybe (for instance ioncube's encoded file contains a very long alphabet string, etc.
One idea is to check for whitespace. The first thing that an obfuscator will do is to remove extra whitespace. Another thing you can look for is the number of characters per line, as obfuscators will put all the code into few (one?) lines.
Often, obsfuscators initialize very large arrays to translate variables into less meaningful names (eg. see obsfucator article
One technique may be to search for these super-large arrays, close to the top of the class/file etc. You may be able to hook xdebug up to examine/look for these. The whole thing of course depends on the obsfuscation technique used. Check the source code, there may be patterns they've used that you can search on.
I think you can use token_get_all() to parse the file - then compute some statistics. For example check for number of function calls(in calse obfuscator uses some eval() string and nothing else) and calculate average function length - for obfuscators it will usually be about 3-5 chars, for normal PHP code it should be much bigger. You can also use dictionary lookup for function/variable names, check for comments etc. I think if you know all obfuscator formats that you want to detect - it will be easy.

Categories