PHP trim non standard characters from the string - php

I am trying to save an XML file with a string pulled out of a text file (which is actually a converted PDF to TXT file). In CMD (php.exe) the echo command shows the string normally, without any extra characters, but in an XML file I get a different input.
This is the string that I am trying to save.
Ponedjeljak
In CMD it shows it like this
Ponedjeljak\n
While in XML the string is stored with some extra characters, like this
Ponedjeljak
I have tried using preg_replace like this
preg_replace("/&#\\d+;|\n/", "", $dan);
But the string and the extra line are still saved in the XML. What am I doing wrong here and why is it saving the extra characters in the XML file? Both PHP and XML files are in UTF-8 encoding.

Try this:
$string = str_replace(array("\n", "\r"), '', $string);

Related

PHP replace string with file_get_contents

I'm trying to replace a string in a file, so far so good.
The file is a htm file in html format bt this shouldn't be a problem.
My script looks like this:
$file = './test.htm';
$content = file_get_contents($file);
$str = str_replace('Signature','Test',$content);
file_put_contents('./test2.htm', $str);
The problem is str_replace doesn't replace the string "Signature", the output file has exactly the same content as my input file.
If I use the file content without file_get_contetnts, just while defining the string as a variable, my script works like a charme.
Your code looks fine.
Make sure you have actually 'Signature' in your code.
Make sure you don't use any non-printable unicode characters with Signature.
Append 'Signature' at the end of your test.htm and see if your code
works.
Edited:
Make sure your file use a valid and supported encoding( like UTF-8 )
With help from #MB-abb I found out that the encoding is UCS-2 LE BOM.
The added line in my script is now:
$str = mb_convert_encoding($str, "UTF-8", "UCS-2LE");
Which changes UCS-2LE to UTF-8.
Now str_replace works like a charme.
Thanks!

Replacing smart quote with regular quote causes entire string to be erased

I have some files that were originally RTF files. They were opened with Microsoft Word 2016 and saved as .txt files. No other changes were made to the files. They were transferred to a Linux system.
When using the command:
file myfile.txt on the Linux they are showing as Non-ISO extended-ASCII text, with CRLF line terminators.
I am reading the files into PHP and processing them line by line. I am trying to replace any right smart-quotes with regular single quotes, but my entire string is being erased.
My code looks like this:
$text = "I can’t go for supper";
$text = preg_replace('/\x{2019}/u', "'", $text);
echo $text;
The apostrophe here is a right smart quote which shows up in Vim as <92>. Upon researching on the web, I have discovered this is actually unicode character 2019.
However, when I try to display the new value of $text nothing is displayed.
What is wrong with my code and why is it wiping out the entire string of text?
Upon further research, I have determined that character code <92> is specific to the Windows-1252 character encoding. I first needed to convert it to UTF-8 before I was able to manipulate the string.
The following code works correctly:
$text = "I can’t go for supper";
$text = iconv("Windows-1251", "UTF-8", $text);
$text = preg_replace('/\x{2019}/u', "'", $text);
echo $text;

Using PHP to write a CSV file with Spanish characters

I have a php script with utf-8 encoding. In it I have an array with special characters (like and n with a ~ on top). It looks just fine in my editor. The php matches the array with text coming in from a html form and writes a csv file. When I write the file I do it like this;
fwrite($fp,utf8_encode($data),strlen($data)+100);
When I open the file it says it is utf-8 encoded but the charters are all messed up.
have you tried without using utf_encode() on the data?
it seems that you are reencoding something that's already utf-8 encoded

Encoding problems in odtphp segments

I'm writing text from database to ODT document table using odtphp, using this http://www.odtphp.com/index.php?i=tutorials&p=tutorial6 example. In generated ODT some international characters are encoded wrong (or not encoded?). There was similar problem with other values, not in segments, that were set using setVar() function, but it was solved using
$odf->setVars($k, $v, true, 'UTF-8');
Looks like there's no additional settings for segment values.
Looks like that all text in segments were encoded to UTF-8 again, even if text has already been in UTF-8.
Currently I solved this issue by replacing line 203 in Segment.php from odtphp with following code:
return $this->setVars($meth, $args[0], false, 'UTF-8');

weird characters in my generated PDF

I'm getting 􀀊􀀠􀀉􀀉 characters in my PDF, i've stripped out \r\n \r \n \t, trimmed everything, decoded html entities and stripped tags. Nothing helps. The data is coming from a MySQL database.
Any help would be appreciated.
Check string encoding (with mb_detect_encoding) before adding to pdf, is it unicode string? Data in MySQL db can be in unicode but your db connection can use some another encoding.
Did you try using utf8_decode()?
http://php.net/manual/en/function.utf8-decode.php
You might be using a font that is not available.
Try something like this to determine its numeric value and replace it:
$str = 'Hello 􀀊 World';
echo str_replace(chr(ord('􀀊')), '[removed]', $str);
Output:
Hello [removed] World
Have you tried
$string = "testContainingSpecialCharsäöüöüäüß";
$pdf->Cell(0,0,$string);
What characters should have been displayed instead of those 􀀊􀀠􀀉􀀉 things?
FPDF doesn't support unicode characters, so that might be the cause of your problem. There's an extension you could try at http://acko.net/node/56, or alternatively you could switch to another PDF generator library (I recommend TCPDF).
Or you could try using iconv to convert the text from UTF-8 to a supported character set (ie. $str = iconv('UTF-8', 'windows-1252', $str);) if you want to stick with FPDF.
Looks like the result of what happens when you copy / paste text from Microsoft word. Does the PDF file contain text from a MS Word document by any chance? That might be your problem. There are some interesting comments for converting and stripping these characters in PHP on the PHP.net website: http://www.php.net/manual/en/function.strtr.php#39383
I am only presuming it is MS Word characters in your PDF file.

Categories