PHP: string breaks at special character

PHP: string breaks at special character - php

I wrote a small PHP script which does a "branding" on a present PDF file. This means on every page I put a string like "belongs to " at a special position. Therefor I use Zend_Pdf out of the Zend Framework.
Because the script is used in German language area, in one line there I use the special character "ö" ("Gehört zu ").
On my local machine (Windows, XAMPP) the script worked fine, but when moving it to my hoster's webspace (some Linux), the string breaks at "ö". That means in my PDF on appears "Geh".
The code is this:
if (substr($file, strlen($file) - 4) === '.pdf') {
$name = $user->GetName;
$fontSize = 12;
$xTextPos = 100;
$yTextPos = 10;
set_include_path(dirname(__FILE__)); // set include_path for external library Zend Framework
require_once('Zend' .DS . 'Pdf.php');
$pdf = Zend_Pdf::load($file);
$font = Zend_Pdf_Font::fontWithName(Zend_Pdf_Font::FONT_HELVETICA);
$branding = 'Gehört zu ' . $name; // German for: 'Belongs to ', problem with 'ö'
foreach ($pdf->pages as &$page) {
$page->setFont($font, $fontSize);
$page->drawText($branding, $xTextPos, $yTextPos);
}
}
I guess the problem is related to some kind of default charset or language setting of the PHP environment. So I searched here and tried out:
$branding = utf8_encode('Gehört zu ') . $name;
...and I made some experiments with functions like html_entity_decode but nothing helped and I decided stopping groping in the dark and open an own question.
Looking forward to any hints. Thank you in advance for your help!
EDIT: Meanwhile I found the same (?) problem, solved on a German forum. But if I do it like they say...
$branding = mb_convert_encoding('Gehört zu ', 'ISO-8859-1') . $name;
... the resulting branding in the PDF is "Gehrt zu ". The "ö" is skipped now.
For this I found another hint on the Zend issue tracker.
I sum up, that I can drop all UTF8 things and concentrate on Latin-1 AKA ISO 8859-1.
I still don't understand why the code worked on my Windows + XAMPP and now crashes on my hoster's Linux.

Your guess is right, the problem is related to encoding. Where exactly the encoding is messed up is hard to say from afar. I'm assuming you work not only with Zend_Pdf, but also have the MVC in place (meaning a complete Zend_Application).
You should check if your application serves pages as UTF-8, by setting:
resources.view.encoding = "UTF-8"
and also placing the appropriate meta-tags in your layout/view.
Depending on what Editor you use, your files may be encoded in a different encoding. You can use Notepad++ on Windows to check your file-encoding and for converting it to UTF-8 (don't just set the encoding to UTF-8, this might mess up your file!) if necessary. I recommend using Eclipse with text file encoding set to "UTF-8" (Preferences > General > Workspace) to make sure your code files are encoded in UTF-8.
Now comes the crucial part:
Zend_Pdf_Page::drawText(string $text, float $x, float $y, string $charEncoding)
See that last argument... set it. If you're lucky, you can skip the previous stuff and just set the encoding there.
edit: I missed something. Database connections. You should check the encoding there too. I frequently work with MS SQL Server, which uses Latin-1 internally; not setting driver_otpions.CharacterSet can mess up stuff pretty bad too. This might be relevant, if you have soemthing like Gehört zu: Günther, where the Name Günther is fetched from db.

Encoding is also depending of the file encoding.
If you encode your file in UTF8 for example and use ut8_encode("ö"), then you'll encode in UTF_8 something already in UTF_8.
So you may want to check what your file encoding is, and what your PDF lib is requiring. Then apply the right formula/transformation.

Related

Why won't this PHP script strip the BOM from a file?

I've tried everything I can think of (including notepad++, which would not work for me although it appears others have better luck).
I've some very old Access databases (I mean from the mid/late 1990s created using Visual Basic 4) which I'm converting to Web Apps (PHP, Javascript etc using a lot of AJAX).
I've an old copy of Access2000 which despite a myriad of errors does actually still work under Windows 10 with some persuasion; which I'm using to export data with.
What I have a slight problem with is encoding. I'm trying to export as UTF8 to try and avoid diamonds with ? marks in them. Access has a UFT8 export setting but it puts a BOM (FFFE) at the start of the file(s); which messes things up a bit.
Simple question: why does not the following work? Well, actually, it works but it strips the first 2 bytes after the BOM and leavs the BOM in place.
It's nothing I can't work round so no sweat; just wondering.
<?php
$h = fopen("test.txt", "rb");
$w = fopen("fixed.txt", "wb");
$r = fread($h, "2");
do{
$r = fread($h, "1");
fwrite($w, $r);
} while (!feof($h));
fclose($h);
?>

Special characters encoding in image filenames after server migration

I've migrated a WordPress website from a Hostgator shared host to a Ubuntu Digital Ocean LAMP stack.
The trouble started when I exported the image files which had special characters, for example the file
operários_tarsila-1024x640.jpg.
When WordPress tries to reach the file, it displays an error. I've found the cause:
I can see via Inspect Element that Wordpress tries to call: http://mywebsite.com/wp-content/uploads/2013/02/oper%C3%A1rios_tarsila-1024x640.jpg and the server returns a 404 error.
However if I type this URL in the browser: http://mywebsite.com/wp-content/uploads/2013/02/opera%CC%81rios_tarsila-1024x640.jpg it works and the image is displayed.
So, it seems like this difference between the á encoding from %C3%A1 (á character) to a+%CC%81 (combining accute accent) is what is causing WordPress to not display my images.
So now I have in my server thousands of accented image filenames with the structure character+ combining accent and WordPress calling the image filenames with the structure accented character.
Is there a way bash rename all of them with a comparisson table? Or a way to make Apache aware of those differences and point to the right file when this kind of confusion happen?

Apparently the problem is how the backup is decompressed on the new server.
There are 2 ways to fix this:
Rename the files manually by names without accents and then modify the database and change the file names in the database (This maluco and can be dangerous, it would be best to back up the database).
Upload files using Filezilla, but setting it to force the charset encoding in UTF-8.
File> Site Manager> {YOUR SITE}> Tab Charset> Force UTF-8

We have same problem - Mac + FileZilla + special characters in SK language.
Problem fixed using another FTP client (Cyberduck in our case ).
It seems to be a problem with FileZilla filenames encofing. Force utf8 encoding (FileZilla host settings) doesn't help.

So, just to touch upon this issue and a solution that worked for me... I also migrated a Wordpress site and found that all images with special characters in their filename produced a 404 after migration.
I ended up having to do the manual file renaming and edits to the database via phpMyAdmin. It was arduous and I definitely recommend backing up your database first.
In my case, I had a ton of media attachments that used the special character © in their filename.
First, I locally renamed the files by removing the character. I used 1-4a rename. Just found the filename and replaced it with nothing (not even a space). Then, I removed all the old files from the /wp-content/uploads/ folder and replaced them with the new files.
Next, I went into my database to update the table values. Media attachments have info stored in both the wp_posts and wp_postmeta tables. Below is the SQL I ran to update both -
update wp_posts set guid = replace(guid,'©','');
UPDATE wp_postmeta SET meta_value = REPLACE(meta_value, '©', '')
WHERE LOWER(RIGHT(meta_value, 5)) = '.jpeg' OR
LOWER(RIGHT(meta_value, 4)) IN ('.jpg', '.gif', '.png')
Which, again, we are replacing the character with nothing, not even a space.
I had to use the WP plugin Regenerate Thumbnails in order to have all of thumbnails + various attachment sizes update, but that did the trick.
I really appreciate everyone's efforts on this post and this post to help me figure it out! Hope this helps someone!

Have you tried setting the same encoding in PHP script, Mysql and HTML ?
PHP : http://php.net/manual/en/function.mb-internal-encoding.php
Mysql : http://php.net/manual/en/function.mysql-set-charset.php
HTML : <meta http-equiv="content-type" content="text/html; charset=utf-8" />
This problem is looking like a charset accordance problem between all these languages.
If this is not working, you will have to use a small script to rename all your pictures, using a function like :
function wd_remove_accents($str, $charset='utf-8')
{
$str = htmlentities($str, ENT_NOQUOTES, $charset);
$str = preg_replace('#&([A-za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#', '\1', $str);
$str = preg_replace('#&([A-za-z]{2})(?:lig);#', '\1', $str); // pour les ligatures e.g. 'œ'
$str = preg_replace('#&[^;]+;#', '', $str); // supprime les autres caractères
return $str;
}
Source : http://www.weirdog.com/blog/php/supprimer-les-accents-des-caracteres-accentues.html

We have just had a similar problem with french caracters in our wordpress deployment, and our solution was to upload the files with FileZilla from a PC instead of FileZilla from a Mac.
When I would upload from mac OSX to the CentOS server, the files will only show if called in the a+%CC%81 format.
When I uploaded the files from the PC, apache found the files in the %C3%A1 format, which was how wordpress had them encoded.

If you have WP_CLI run this BashScript. You must change the wp_ table prefix.
It only modifies the file-names that are NOT on FORM_D format.
Backup your DB just in case something goes wrong.
#!/bin/bash
normalizeWP_PHP_Script=$'
global $wpdb;
$rows = $wpdb->get_results( "SELECT * FROM wp_postmeta where meta_key='"'"'_wp_attached_file'"'"'");
foreach ( $rows as $row )
{
$postId = $row->{'"'"'post_id'"'"'};
$filePath = $row->{'"'"'meta_value'"'"'};
if( ! normalizer_is_normalized($filePath, Normalizer::FORM_D) ){
$filename_nfd = Normalizer::normalize($filePath, Normalizer::FORM_D);
echo $filename_nfd." | ";
$wpdb->query($wpdb->prepare("UPDATE wp_postmeta SET meta_value='"'"'$filename_nfd'"'"' WHERE post_id=$postId"));
}
}';
wp eval "$normalizeWP_PHP_Script"
echo " - Uploads-url nomalized --nfd"

There's a plugin for this situation.
You can check on Media File Renamer

PHP's base64_decode returns wrong result working in IIS

I'm trying to show a jpg that was previously encoded in a WCF web service using:
<?php
require_once '../inc/config.php';
[...]
header("Content-type: image/jpg");
echo base64_decode($doc['BDATA']);
But I'm getting a
Can't display the image because it contains errors.
I've decoded the base64 string in this web app www.opinionatedgeek.com/dotnet/tools/base64decode/ and the result is right, but different that the one I'm getting with base64_decode, which is wrong.
Edit: I have two enviroments using the same code: Test and Production. It works fine in Test, but not in Production, so I'm thinking in some configuration problem.
I'm working with PHP 5.5.9 in Microsoft IIS.
An example of a string that base64_decode isn't decoding well:
/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCAABAAEDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD9/KKKKAP/2Q==
Any ideas?
Edit 2: If I comment this line
require_once '../inc/config.php';
and copy the code from config.php to my actual file, it works fine. What could be happening?

From base_64_decode manual comments
php <= 5.0.5's base64_decode( $string ) will assume that a space is
meant to be a + sign where php >= 5.1.0's base64_decode( $string )
will no longer make that assumption
To fix this behavior try this code
$encodedData = str_replace(' ','+',$encodedData);
$decocedData = base64_decode($encodedData);
As this is no't your case then you have to check this answer
Because every thing work fine for me here on (WAMP)
EDIT:
As in our below conversation
There are a lot of things that may corrupt header for example , if
your file encoding is UTF-8 then you should save it as UTF-8 Without
bom you can do this using notepad ++ , also make sure if you use FTP
that your client didn't any chars to your file , rather than that
every thing should work fine

base64 encoding is not completely standardised.
Some implementations use different characters, so you'll have to replace those characters before you run your decode.
further details

After update to 5.4, fopen can't read file

I have a website on a host that recently switched from PHP 5.2 to 5.4, and required us to chose a new php.ini file: 5.4 plain, 5.4 solo (just one php.ini file used throughout the site), and 5.4 fast.
I do not know which one I was using prior to making the switch, but when I did, (I chose 5.4 solo), I noticed that a part of my website that depends on mbstring (multibyte characters) no longer works.
In specific, it opens a text file that is full of characters and then that is used in an encryption script and it stores garbage in the mysql database. Then to retrieve it, it's again run through the script and decrypted, and displayed on the screen.
This worked just fine until the 5.4 change. Now it appears that it's unable to retrieve (open?) the text file. I have tested this with a non-multibyte character version and that works fine, so I don't think the issue is with the code, but rather with the way PHP is treating multibyte chars...and I suspect, just a hunch, that this is fixable by tweaking the PHP.ini file somehow. Zend.multibyte seems to be PHP's new thing.
My problem is that I have no idea what to tweak. I tried several different Zend.multibyte/mbstring combos and that didn't work.
I know that everything works up until a string is sent for encryption. It comes back as a null value, instead of a garbled string. I feel like something in the string is being rejected by PHP and thus it's failing...offering nothing instead of the string it should.
Does anyone have a thought as to what might be happening and why my script no-longer works with 5.4? I have checked and the mbstring module IS loaded, with default values in the php.ini.
Any suggestions would be great...I'm totally stumped. Even some additional reports or ways to test or narrow down the problem would be fantastic.
Thank you!
Here is some code, where I think the problem is:
$this->s1 = "";
$s1array = array("a1.txt", "a2.txt", "a3.txt");
foreach ($s1array as $i => $value) {
$myFile = "../a/dir/somewhere/$s1array[$i]";
$fh = fopen($myFile, 'r');
$theData = fgets($fh);
fclose($fh);
$this->s1 .= html_entity_decode($theData, ENT_NOQUOTES, 'UTF-8');
}
The files ../a/dir/somewhere/a1.txt and ../a/dir/somewhere/a2.txt (etc) are semi-comma delimited strings of html coded letters, for example: & #x0fb0f;& #x02c97;& #x00436;& #x10833;& #x00514; (I added the spaces so it would show code not the HTML values!).
But I guess now, for some reason, this above code isn't returning any results. If I assign the result to a variable and echo that variable, there's nothing. But if I assign $this->s1 = "abcde"; or a longer string and skip the "foreach" part, it will work. So something in this process, this code, no longer works in 5.4. Can anyone tell what's going on here? Thank you!

Why you use fopen and so on for text files when you could use file_put_contents and file_get_contents - they are mostly wrappers for fopen, freads and so on. I have NEVER ever had any problems with UTF8 using that two functions.
Also make sure everything (from php, to db if you are using it, and php files) are encoded or using utf8. There is nothing funnier than *.php files in for example latin2 and all the rest in utf8.

How to format incoming email text for HTML display

I've set up a script that processes incoming emails and creates blog entries on Blogger. I'm using PEAR's Mail_Mime libs (for now) to read the incoming message. The messages often have characters in them that cannot be read by browsers--this happens most often when people use Outlook or cut/paste from MS Word.
So the output at the other end is something like this:
Here is a test post with “quotes” and ‘apostrophes�for what it�s worth, it also has dashes�and other strange formatting cut and paste from MS Word.
You can also see the output in the wild.
It's not hard to fix any specific instance, but each client (hotmail, gmail, outlook, etc) seems to handle things just a bit differently. Mail_Mime only seems to munge the output and, if I turn off Mail_Mime's parsing and try to translate the encoded characters myself using mb_convert_encoding or some manual simulation of this, it's even worse.
Please not that this is not going to be solved by selecting the right encoding type and using decode/encode/convert functions. The incoming formats vary from Windows-1252 to UTF8 to just about anything else mail clients can think of.
Has anyone scripted this before that could save me some time by offering up a sample or advice on the best approach? I've tried all the simple answers and done plenty of experimenting, so please don't bother responding unless you've dealt with a similar issue successfully or have a deep understanding of encoding issues.

The only way to do this is to do it by the spec's which is I'm afraid to pull in the 'Content-Type' mime header, pick up the charset (it'll look like Content-Type: text/plain; charset="us-ascii") then convert to UTF-8, and of course ensure your output on the web is sent as UTF-8 with the right headers.

To solve this problem, and get my message into valid UTF-8 that is readable from a browser, I found this PHP lib, ConvertCharset by Mikolaj Jedrzejak, which worked on almost everything. It still had issues with a specific symbol (=A0) when converting from Windows-1252 or iso-8859-1. So I converted this character manually before setting the code loose.
Here's what it looks like overall:
// decode using Mail_Mime
require 'Mail.php';
require 'Mail/mime.php';
require 'Mail/mimeDecode.php';
$params['include_bodies'] = true;
$params['decode_bodies'] = true; // this decodes it!
$params['decode_headers'] = true;
$decoder = new Mail_mimeDecode($input);
$mime = $decoder->decode($params);
// too much work to put in this example
$charset = ...; //do some magic with $mime->parts to get the character set
$text = ...; //do some magic with $mime->parts to get the text
// fix the =A0 control character; it's already been decoded
// by Mail_Mime, so we need the actual byte code now
// this has to be done before trying to convert to UTF-8
$char = chr(hexdec(substr('A0',1)));
$text = str_replace($char, '', $text);
// convert to UTF-8 using ConvertCharset
require 'ConvertCharset.class.php';
if( strtolower($charset) != 'utf-8' ) {
$converter = new ConvertCharset($charset, 'utf-8', false);
}
$text = $converter->Convert($text);
Then everything is spiffy. It even does the infamous Iñtërnâtiônàlizætiøn conversion, as well as accepting french, spanish, and pastes directly from MS Word :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP: string breaks at special character - php

Related

Why won't this PHP script strip the BOM from a file?

Special characters encoding in image filenames after server migration

PHP's base64_decode returns wrong result working in IIS

After update to 5.4, fopen can't read file

How to format incoming email text for HTML display

Categories

Resources