Replacing smart quote with regular quote causes entire string to be erased - php

I have some files that were originally RTF files. They were opened with Microsoft Word 2016 and saved as .txt files. No other changes were made to the files. They were transferred to a Linux system.
When using the command:
file myfile.txt on the Linux they are showing as Non-ISO extended-ASCII text, with CRLF line terminators.
I am reading the files into PHP and processing them line by line. I am trying to replace any right smart-quotes with regular single quotes, but my entire string is being erased.
My code looks like this:
$text = "I can’t go for supper";
$text = preg_replace('/\x{2019}/u', "'", $text);
echo $text;
The apostrophe here is a right smart quote which shows up in Vim as <92>. Upon researching on the web, I have discovered this is actually unicode character 2019.
However, when I try to display the new value of $text nothing is displayed.
What is wrong with my code and why is it wiping out the entire string of text?

Upon further research, I have determined that character code <92> is specific to the Windows-1252 character encoding. I first needed to convert it to UTF-8 before I was able to manipulate the string.
The following code works correctly:
$text = "I can’t go for supper";
$text = iconv("Windows-1251", "UTF-8", $text);
$text = preg_replace('/\x{2019}/u', "'", $text);
echo $text;

Related

PHP .rtf encoding problem with polish characters

Got a problem with replacing polish characters through php in rtf file.
I want to find tagwords in rtf file content in replace them with relevant content
So what I'm doing:
// Getting rtf file content
$content = file_get_contents('<link_to_file_here>');
// encoding to utf-8
$content = mb_convert_encoding($content, 'UTF-8');
// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Częstochowa', $content);
// save rtf file with replaced content
file_put_contents('uploads/test.rtf', $content);
echo $content;
When i check what happened with rtf file content after this code executed, i've noticed that Częstochowa replaced with Cz\u0119stochowa.
Then i open a new created rtf file in MS Word and see this Częstochowa.
After this i decided to write Częstochowa manually in rtf file and check what happens. I get file content the same way (via file_get_contents) and noticed that MS Word replaced my manually wrote Częstochowa with Cz\\'eastochowa. So i decided to do this:
// replacing tagword with relevant content
$content = str_replace('[company_address]', 'Cz\\\'eastochowa', $content);
And after this i open file in MS Word and see this Czêstochowa
Googled a bit and found that ê is character from Unicode Block “Latin-1 Supplement” (from U+0080 to U+00FF) with code U+00EA but polish characters are in Unicode Block “Latin Extended-A” (from U+0100 to U+017F), so i need to encode rtf file content to it somehow
I tried a lot of things but still didn't solve the problem.
Hope on Your help. Thanks for attention.
Found a solution:
$string = str_replace('&#', "\\u", mb_convert_encoding('Częstochowa', 'html'));
$content = str_replace('[company_address]', $string, $content);

PHP trim non standard characters from the string

I am trying to save an XML file with a string pulled out of a text file (which is actually a converted PDF to TXT file). In CMD (php.exe) the echo command shows the string normally, without any extra characters, but in an XML file I get a different input.
This is the string that I am trying to save.
Ponedjeljak
In CMD it shows it like this
Ponedjeljak\n
While in XML the string is stored with some extra characters, like this
Ponedjeljak
I have tried using preg_replace like this
preg_replace("/&#\\d+;|\n/", "", $dan);
But the string and the extra line are still saved in the XML. What am I doing wrong here and why is it saving the extra characters in the XML file? Both PHP and XML files are in UTF-8 encoding.
Try this:
$string = str_replace(array("\n", "\r"), '', $string);

Cross platform Word Document encoding issue

I have a PHP script that generates Word documents (.doc).
It takes characters from HTML entities, e.g. Π, and decodes them with PHP's html_entity_decode().
$line = html_entity_decode($text, ENT_NOQUOTES | ENT_HTML401, "UTF-8");
When opening the resulting file in Libre Office on Linux, the file loads correctly (characters are correctly encoded). However when opening in Microsoft Word on Windows, the non-ASCII characters are incorrect. For example, the capital Greek letter PI (Π) is rendered as the Chinese character (螤).
I figure there is a missing header or metadata that tells word that the data is encoded in UTF-8.
Hate to look the fool, but I've answered my own question.
Just after opening the file for writing, I added:
fwrite($fp, pack("CCC",0xef,0xbb,0xbf));
This writes the UTF-8 byte order mark to the start of the file.
http://en.wikipedia.org/wiki/Byte_order_mark

PHP Removing Windows ^M Character

I have a CSV I am downloading from a source I'm not in control of and the end of each line is a
^M
character when printed to a bash terminal. How can I sanitize this input programmatically in PHP?
What you're seeing is a Windows control character. To get rid of this in PHP, what you need to do is
$file = str_ireplace("\x0D", "", $file)
this will work whether hexadecimal is lowercase or uppercase.
You can also ask PHP to auto detect any weird line endings by just adding in this line before reading the CSV file and you won't be required to do anything else.
ini_set('auto_detect_line_endings', true);
^M is a carriage return, you should be able to remove it with:
$string = str_replace( "\r", "", $string);

weird characters in my generated PDF

I'm getting 􀀊􀀠􀀉􀀉 characters in my PDF, i've stripped out \r\n \r \n \t, trimmed everything, decoded html entities and stripped tags. Nothing helps. The data is coming from a MySQL database.
Any help would be appreciated.
Check string encoding (with mb_detect_encoding) before adding to pdf, is it unicode string? Data in MySQL db can be in unicode but your db connection can use some another encoding.
Did you try using utf8_decode()?
http://php.net/manual/en/function.utf8-decode.php
You might be using a font that is not available.
Try something like this to determine its numeric value and replace it:
$str = 'Hello 􀀊 World';
echo str_replace(chr(ord('􀀊')), '[removed]', $str);
Output:
Hello [removed] World
Have you tried
$string = "testContainingSpecialCharsäöüöüäüß";
$pdf->Cell(0,0,$string);
What characters should have been displayed instead of those 􀀊􀀠􀀉􀀉 things?
FPDF doesn't support unicode characters, so that might be the cause of your problem. There's an extension you could try at http://acko.net/node/56, or alternatively you could switch to another PDF generator library (I recommend TCPDF).
Or you could try using iconv to convert the text from UTF-8 to a supported character set (ie. $str = iconv('UTF-8', 'windows-1252', $str);) if you want to stick with FPDF.
Looks like the result of what happens when you copy / paste text from Microsoft word. Does the PDF file contain text from a MS Word document by any chance? That might be your problem. There are some interesting comments for converting and stripping these characters in PHP on the PHP.net website: http://www.php.net/manual/en/function.strtr.php#39383
I am only presuming it is MS Word characters in your PDF file.

Categories