wide strings in php?

wide strings in php? - php

Originally, I had the problem that, although I had the same path by optical inspection, file_exists() returned true for one and false for the other. After spending hours narrowing down my problem, I wound up with the following code... (paths redacted)
$myCorePath = $modx->getOption('my.core_path', null, $modx->getOption('core_path').'components/my/');
$pkg1 = $myCorePath.'model/';
$pkg2 = MODX_CORE_PATH . 'components/my/model/';
$pkg3 = '/path/to/modx/core/components/my/model/';
var_dump($pkg1, $pkg2, $pkg3);
...and its output:
string '/path/to/modx/core/components/my/model/' (length=37)
string '/path/to/modx/core/components/my/model/' (length=78)
string '/path/to/modx/core/components/my/model/' (length=78)
So two versions, interestingly including simply writing the string down, apparently use wide characters (these worked, file_exists()-wise), while sadly my preferred variant uses narrow characters. I tried to research this but the only thing I wound up with told me that php has no such thing as wide strings. I also verified with a hex editor that all string constants really only take one byte per character in the php file.
phpinfo() tells me I have PHP Version 5.4.9, and I run on a 64 bit linux machine, fwiw. The manual was edited a week ago; is its info not accurate, or what's going on here?

I think it is caused by multibyte coding.

Related

Getting different output for same PHP code

(Can't paste the exact question as the contest is over and I am unable to access the question. Sorry.)
Hello, recently I took part in a programming contest (PHP). I tested the code on my PC and got the desired output but when I checked my code on the contest website and ideone, I got wrong output. This is the 2nd time the same thing has happened. Same PHP code but different output.
It is taking input from command line. The purpose is to bring substrings that contact the characters 'A','B','C','a','b','c'.
For example: Consider the string 'AaBbCc' as CLI input.
Substrings: A,a,B,b,C,c,Aa,AaB,AaBb,AaBbC,AaBbCc,aB,aBb,aBbC,aBbCc,Bb,BbC,BbCc,bC,bCc,Cc.
Total substrings: 21 which is the correct output.
My machine:
Windows 7 64 Bit
PHP 5.3.13 (Wamp Server)
Following is the code:
<?php
$stdin = fopen('php://stdin', 'r');
while(true) {
$t = fread($stdin,3);
$t = trim($t);
$t = (int)$t;
while($t--) {
$sLen=0;
$subStringsNum=0;
$searchString="";
$searchString = fread($stdin,20);
$sLen=strlen($searchString);
$sLen=strlen(trim($searchString));
for($i=0;$i<$sLen;$i++) {
for($j=$i;$j<$sLen;$j++) {
if(preg_match("/^[A-C]+$/i",substr($searchString,$i,$sLen-$j))) {$subStringsNum++;}
}
}
echo $subStringsNum."\n";
}
die;
}
?>
Input:
2
AaBbCc
XxYyZz
Correct Output (My PC):
21
0
Ideone/Contest Website Output:
20
0

You have to keep in mind that your code is also processing the newline symbols.
On Windows systems, newline is composed by two characters, which escaped representation is \r\n.
On UNIX systems including Linux, only \n is used, and on MAC they use \r instead.
Since you are relying on the standard output, it will be susceptible to those architecture differences, and even if it was a file you are enforcing the architecture standard by using the flag "r" when creating the file handle instead of "rb", explicitly declaring you don't want to read the file in binary safe mode.
You can see in in this Ideone.com version of your code how the PHP script there will give the expected output when you enforce the newline symbols used by your home system, while in this other version using UNIX newlines it gives the "wrong" output.
I suppose you should be using fgets() to read each string separetely instead of fread() and then trim() them to remove those characters before processing.

I tried to analyse this code and that's what I know:
It seems there are no problems with input strings. If there were any it would be impossible to return result 20
I don't see any problem with loops, I usually use pre-incrementation but it shouldn't affect result at all
There are only 2 possibilities for me that cause unexpected result:
One of the loops iteration isn't executed - it could be only the last one inner loop (when $i == 5 and then $j == 5 because this loop is run just once) so it will match difference between 21 and 20.
preg_match won't match this string in one of occurrences (there are 21 checks of preg_match and one of them - possible the last one doesn't match).
If I had to choose I would go for the 1st possible cause. If I were you I would contact concepts author and ask them about version and possibility to test other codes. In this case the most important is how many times preg_match() is launched at all - 20 or 21 (using simple echo or extra counter would tell us that) and what are the strings that preg_match() checks. Only this way you can find out why this code doesn't work in my opinion.
It would be nice if you could put here any info when you find out something more.
PS. Of course I also get result 21 so it's hard to say what could be wrong

UTF-8, PHP, Win7 - Is there a solution now to save UTF-8-filenames on Win 7 using php?

Update: Just to not make you reading through all: PHP starting with
7.1.0alpha2 supports UTF-8 filenames on Windows. (Thanks to Anatol-Belski!)
Following some link chains on stackoverflow I found part of the answer:
https://stackoverflow.com/a/10138133/3716796 by Umberto Salsi
(and on the same question: https://stackoverflow.com/a/2950046/3716796 by Artefacto)
In short: 'PHP communicate[s] with the underlying file system as a "non-Unicode aware program"', and because of that all filenames given to PHP by Windows and vice versa are automatically translated/reencoded by Windows. This causes the errors. And you seemingly can't stop the automatic reencoding.
(And https://stackoverflow.com/a/2888039/3716796 by Artefacto: "PHP does not use the wide WIN32 API calls, so you're limited by the codepage.")
And at https://bugs.php.net/bug.php?id=47096 there is the bug report for PHP.
Though on there nicolas suggests, that a COM-object might work! $fs = new COM('Scripting.FileSystemObject', null,
CP_UTF8);
Maybe I will try that sometimes.
So there is the part of my questionleft : Is there PHP6 out, or was it withdrawn, or is there anything new on PHP about that topic?
// full Question
The most questions about this topic are 1 to 5 years old.
Could php now save a file using
file_put_contents($dir . '/' . $_POST['fileName'], $_POST['content']);
when the $_POST['fileName'] is UTF-8 encoded, for example "Крым.xml" ?
Currently it is saved as
ÐšÑ€Ñ‹Ð¼.xml
I checked the fileName variable, so I can be sure it's UTF-8:
echo mb_detect_encoding($_POST['fileName']);
Is there now anything new in PHP that could accomplish it?
At some places I read PHP 6 would be able to do it, but PHP 6 if i I remember right, has been withdrawn. ?
In Windows Explorer I can change the name of a file to "Крым.xml". As far as I have understood the old questions&answers, it should be possible to use file_put_contents if the fileName-var is simply encoded to the encoding used by windows 7 and it's NTFS disc.
There is even 3 old question with answers that claim to have succeeded: PHP File Handling with UTF-8 Special Characters
Convert UTF-16LE to UTF-8 in php
and PHP: How to create unicode filenames
Overall and most approved answers say it is not possible.
I checked all suggested answers already myself, and none works.
How to definitly and with absolute accuracy find out, in which encoding my Win 7 and Explorer saves the filename on my NTFS disc and with German language setting?
As said: I can create a file "Крым.xml" in the Explorer.
My conclusion:
1. Either file_put_contents doesn'T work correctly when handing over the fileName (which I tried with conversions to UTF-16, UTF-16LE, ISO-8859-1 and Windows-1252) to Windows,
2. or file_put_contents just doesn't implement a way to call Windows' own file function in the appropriate way (so this second possibility would mean it's not a bug but just not implemented.) (For example notepad++ has no problems creating, writing and renaming a file called Крым.xml.)
Just one example of the error messages I got, in this case when I used
mb_convert_encoding($theFilename , 'Windows-1252' , 'UTF-8')
"Warning: file_put_contents(dirToSaveIn/????.xml): failed to open stream: No error in C:\aa xampp\htdocs\myinterface.lo\myinterface\phpWriteLocalSearchResponseXML.php on line 26 "
With other conversion I got other error messages, ranging from 'invalid characters' to no string recognized at all.
Greetings
John

PHP starting with 7.1.0alpha2 supports UTF-8 filenames on Windows.
Thanks.

After update to 5.4, fopen can't read file

I have a website on a host that recently switched from PHP 5.2 to 5.4, and required us to chose a new php.ini file: 5.4 plain, 5.4 solo (just one php.ini file used throughout the site), and 5.4 fast.
I do not know which one I was using prior to making the switch, but when I did, (I chose 5.4 solo), I noticed that a part of my website that depends on mbstring (multibyte characters) no longer works.
In specific, it opens a text file that is full of characters and then that is used in an encryption script and it stores garbage in the mysql database. Then to retrieve it, it's again run through the script and decrypted, and displayed on the screen.
This worked just fine until the 5.4 change. Now it appears that it's unable to retrieve (open?) the text file. I have tested this with a non-multibyte character version and that works fine, so I don't think the issue is with the code, but rather with the way PHP is treating multibyte chars...and I suspect, just a hunch, that this is fixable by tweaking the PHP.ini file somehow. Zend.multibyte seems to be PHP's new thing.
My problem is that I have no idea what to tweak. I tried several different Zend.multibyte/mbstring combos and that didn't work.
I know that everything works up until a string is sent for encryption. It comes back as a null value, instead of a garbled string. I feel like something in the string is being rejected by PHP and thus it's failing...offering nothing instead of the string it should.
Does anyone have a thought as to what might be happening and why my script no-longer works with 5.4? I have checked and the mbstring module IS loaded, with default values in the php.ini.
Any suggestions would be great...I'm totally stumped. Even some additional reports or ways to test or narrow down the problem would be fantastic.
Thank you!
Here is some code, where I think the problem is:
$this->s1 = "";
$s1array = array("a1.txt", "a2.txt", "a3.txt");
foreach ($s1array as $i => $value) {
$myFile = "../a/dir/somewhere/$s1array[$i]";
$fh = fopen($myFile, 'r');
$theData = fgets($fh);
fclose($fh);
$this->s1 .= html_entity_decode($theData, ENT_NOQUOTES, 'UTF-8');
}
The files ../a/dir/somewhere/a1.txt and ../a/dir/somewhere/a2.txt (etc) are semi-comma delimited strings of html coded letters, for example: & #x0fb0f;& #x02c97;& #x00436;& #x10833;& #x00514; (I added the spaces so it would show code not the HTML values!).
But I guess now, for some reason, this above code isn't returning any results. If I assign the result to a variable and echo that variable, there's nothing. But if I assign $this->s1 = "abcde"; or a longer string and skip the "foreach" part, it will work. So something in this process, this code, no longer works in 5.4. Can anyone tell what's going on here? Thank you!

Why you use fopen and so on for text files when you could use file_put_contents and file_get_contents - they are mostly wrappers for fopen, freads and so on. I have NEVER ever had any problems with UTF8 using that two functions.
Also make sure everything (from php, to db if you are using it, and php files) are encoded or using utf8. There is nothing funnier than *.php files in for example latin2 and all the rest in utf8.

Gettext() with larger texts

I'm using gettext() to translate some of my texts in my website. Mostly these are short texts/buttons like "Back", "Name",...
// I18N support information here
$language = "en_US";
putenv("LANG=$language");
setlocale(LC_ALL, $language);
// Set the text domain as 'messages'
$domain = 'messages';
bindtextdomain($domain, "/opt/www/abc/web/www/lcl");
textdomain($domain);
echo gettext("Back");
My question is, how 'long' can this text (id) be in the echo gettext("") part ?
Is it slowing down the process for long texts? Or does it work just fine too? Like this for example:
echo _("LZ adfadffs is a VVV contributor who writes a weekly column for Cv00m. The former Hechinger Institute Fellow has had his commentary recognized by the Online News Association, the National Association of Black Journalists and the National ");

The official gettext documentation merely has this advice:
Translatable strings should be limited to one paragraph; don't let a single message be longer than ten lines. The reason is that when the translatable string changes, the translator is faced with the task of updating the entire translated string. Maybe only a single word will have changed in the English string, but the translator doesn't see that (with the current translation tools), therefore she has to proofread the entire message.
There's no official limitation on the length of strings, and they can obviously exceed at least "one paragraph/10 lines".
There should be virtually no measurable performance penalty for long strings.

gettext effectively has a limit of 4096 chars on the length of strings.
When you pass this limit you get a warning:
Warning: gettext(): msgid passed too long in %s on line %d
and returns you bool(false) instead of the text.
Source:
PHP Interpreter repository - The real fix for the gettext overflow bug

function gettext http://www.php.net/manual/en/function.gettext.php
it's defined as a string input so your machines memory would be the limiting factor.
try to benchmark it with microtime or better with xdebug if you have it on your development machine.

GBP £ symbol in ASCII php file being converted to Â£ on live server (transferring with git)

I have a piece of PHP code, which was written in notepad++ on a Windows 7 machine
The Encoding in notepad++ is set to "Encode to ANSI" (ASCII)
I am them doing this in my code:
utf8_encode("£")
so I am sure to get the utf friendly version of the £ symbol.
All works perfectly fine on the local server.
But when I push it up to my live server I'm getting all sorts of issues with utf8 encoding errors in php.
Is something in the git push/pull process corrupting this, or is it perhaps a locale setting on the live server?
Both local and live servers run ubuntu 12.04
Thanks
Update 1
The actual error I'm getting is
invalid byte sequence for encoding "UTF8": 0xa3'
(This is a Postgres SQL error)
Other difference in local and live is live is over https and local is just http (both apache)
Update 2
Running:
file -bi script.php
on both local and live produces:
text/x-php; charset=iso-8859-1
So it seems as if the encoding of the file is intact?
Update 3
Looking at the local Postgres installation it has the following settings:
ENCODING = 'UTF8'
LC_COLLATE = 'en_GB.UTF-8'
LC_CTYPE = 'en_GB.UTF-8'
Whereas live has:
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
I'm going to see if I can swap the collate types to match local and see if that helps
Update 4
I'm doing this, which is the ultimately resulting in the failing piece of code on live (not local)
setlocale(LC_MONETARY, 'en_GB');
$equivFinal = utf8_encode("£") . money_format('%.2n', $equivFinal);
Update 5
I'm getting closer to the issue.
On local the string is produced as
Â£1.00
On live the string is produced as
Â£ï¿½1.00
So for some reason the live server is adding more crap in when doing the UTF8 conversion
Update 6
Ok so I've pinned it down to this:
setlocale(LC_MONETARY, 'en_GB');
Logger::getInstance(__NAMESPACE__)->info("TEST 01= " .money_format('%.2n', 1.00));
On local it outputs
TEST 01= 1.00
As expected
on live it output
TEST 01= ï¿½1.00
With the random characters added to the start, which is what is causing my utf8 issue as it's croaking on that.
Any idea why money_format would do that on one server and not another?

finally nailed it
it's money_format
if you dont specifiy a locale or specify it incorrectly then it just does its own thing
so i was doing
setlocale(LC_MONETARY, 'en_GB');
and on local that meant money_format just ignored the £ from the start of the output
but on live it meant that money_format put the unicode WTF character.
doing it properly for ubuntu of
setlocale(LC_MONETARY, 'en_GB.UTF-8');
means money_format comes out with £ at the front and therefore i dont need my utf8 rubbish
Update 1
Better still, don't bother with setlocale and I'm just going to do this:
utf8_encode("£") . money_format('%!.2n', $equivFinal);
Which basically formats the money and excludes the symbol prefix
and then better still just use number_format and do
utf8_encode("£") . number_format($equivFinal, 2);
I've learnt something new :)

The issue is that you can't save raw GBP symbol inside ASCII file.

Never use weird characters in your source code because no matter how much they "should" work you always run into problems like this. (You can come up with your own definition of "weird" but mine is anything you can't type in on a us-english keyboard without resorting to alt-codes.)
To get arround this restriction concatinate in the results of the chr() function. (use the following code snipit to find out the parameter you need to pass chr is 163 in this case.)
<?php echo(ord('£')); ?>
so in your case the line would read:
$equivFinal = chr(163) . money_format('%.2n', $equivFinal);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.