Handle arabic string in PHP with Eclipse - php

I am currently working on the localization of a website, which was first in english only. A third party company did the translations, and provided us with an excel file with the translations. Which I successfully converted to a PHP array that I can use in my views. I'm using Eclipse for Windows to edit my PHP files.
All is fine, except that I need to add variables in my strings, ex:
'%1 is now following %2'
In arabic I was provided with strings like this one:
'_______الآن يتتبع _______'
I find that replacing __ with %1 and %2 is incredibly difficult because the arabic part is a right to left string, and the %1, %2 will be considered left-to-right, or right-to-left, and I'm not sure . I hardly have the results I expect with the order of my param, because %1 will sometimes go to the left of the string, sometimes on the right, depending on where I start to type. Copy-pasting the replacement strings can also have the same strange effects.
Most of the times I end up with a string like this one:
%2الآن يتتبع %1
The %1 should be at right hand site, the %2 at the left hand site. The %1 is obviously considered right-to-left string because the % appears on the right. The %2 is considered left-to-right.
I'm sure someone as this issue before. Is there any way it can be done easily in Eclipse? Or using a smarter editor for arabic issues? Or maybe it is a Windows issue? Is there a workaround?
UPDATE
I also tried splitting my string into multiple strings, but this also changes the order of the parameters:
'%1' . 'الآن تتبع' . '%2'
UPDATE 2
It seems that changing the replacement string makes things better. It is probably linked to how numbers are handled in Arabic strings. This string was edited in Eclipse without any problem. The order of the parameter is correct, the string is handled correctly by PHP:
'{var2} الآن يتتبع {var1}'
If no other solution is found, this could be a good alternative.

Being an Arabic speaker I get lots of localization tasks. Although I haven't faced this problem in particular but I've had many left-to-right/right-to-left issues while editing. I've had success working with Notepad++.
So here's what I usually do when I want to edit Arabic text
Open empty Notepad++ *
Set encoding to UTF-8 (Encoding -> Encoding in UTF-8)
Enable RTL mode (View -> Text Direction RTL)
Paste your strings
And here's a screenshot showing how I'm editing your string
*: for some reason, whenever I open an already existing file things go bananas. So maybe I'm being superstitious, but this has always worked for me.
Update: First time I did this I was skeptical because the strings looked wrong, but then I did this:
print_r(str_split($string));
and I saw that they're indeed in the correct order.

#Adnan helped me realize and later confirmed that there are issues when mixing Latin numbers with Arabic text.
Based on that conclusion, the solution is simply to stop using %1, %2, %3, ... as placeholders. I will be using more descriptive keywords instead, for example {USER}, {ALBUM}, {PHOTO}, ...
This shows the expected result in the PHP file and it is easily editable:
'ar' => '{USER} الآن يتابع {ALBUM}'

I would prefer the original Notepad for this kind of task.
Open Notepad, make sure you're in LTR mode
Type %1
Change mode to RTL by pressing CTRL + SHIFT
Paste the arabic string into the editor.
Revert back to LTR by pressing CTRL + SHIFT again.
Type %2
Select all with CTRL + A and copy with CTRL + C
Paste into the IDE. It should look weird but execute as expected.
Reason for using Notepad: More complex editor such as Notepad++, Sublime, Coda (Mac), and some IDEs - in your case Eclipse may not use the correct encoding, and Notepad is simple yet works good for multilangual tasks.

Related

PHP generates spaces before the XML starts [duplicate]

I have a bizarre problem: Somewhere in my HTML/PHP code there's a hidden, invisible character that I can't seem to get rid of. By copying it from Firebug and converting it I identified it as  or 'Zero width no-break space'. It shows up as non-empty text node in my website and is causing a serious layout problem.
The problem is, I can't get rid of it. I can't see it in my files even when turning Invisibles on (duh). I can't seem to find it, no search tool seems to pick up on it. I rewrote my code around where it could be, but it seems to be somewhere deeper in one of the framework files.
How can I find characters by charcode across files or something like that? I'm open to different tools, but they have to work on Mac OS X.
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
If you are using Textmate and the problem is in a UTF-8 file:
Open the file
File > Re-open with encoding > ISO-8859-1 (Latin1)
You should be able to see and remove the first character in file
File > Save
File > Re-open with encoding > UTF8
File > Save
It works for me every time.
It's a byte-order mark. Under Mac OS X: open terminal window, go to your sources and type:
grep -rn $'\xFEFF' *
It will show you the line numbers and filenames containing BOM.
In Notepad++, there is an option to show all characters. From the top menu:
View -> Show Symbol -> Show All Characters
I'm not a Mac user, but my general advice would be: when all else fails, use a hex editor. Very useful in such cases.
See "Comparison of hex editors" in WikiPedia.
I know it is a little late to answer to this question, but I am adding how to change encoding in Visual Studio, hope it will be helpfull for someone who will be reading this sometime:
Go to File -> Save (your filename) as...
And in File Explorer window, select small arrow next to the Save button -> click Save with Encoding...
Click Yes (on Do you want to replace existing file dialog)
And finally select e.g. Unicode (UTF-8 without signature) - that removes BOM

Encoding Dilema ! %C3%A9 vs e%CC%81

if you look to the two links bellow they look the same, but one of them works and the other is not working, after analysing the problem it seems that ther's a diffrence in interpretation of the "é" character and all the accentuated characters and the encoder either treat it as on char or the letter without accentuation + the accetnuation char
This problem is causing the images on website to be broken but they are here in the FTP
The question is how to fix that, is the fix in wordpress, database or server ?
Thank's and sorry for my poor english.
http://r20med.regions20.org/wp-content/uploads/2016/07/Portes-Ouvertes-sur-le-tri-sélectif-à-Hai-Essabah-Oran_017.jpg
http://r20med.regions20.org/wp-content/uploads/2016/07/Portes-Ouvertes-sur-le-tri-sélectif-à-Hai-Essabah-Oran_017.jpg
If you look at the screen capture below (taken from a text editor after copy-pasting your links above), you can see the difference.
The easiest solution I found was to change the filenames in an editor that only uses the accentuation you can see in the first row.
Conclusion: never trust uploads! Always check them for everything that's around! :) (This is true for filenames, texts, HTML, etc. - even if they are not intended to be harmful, they may block functions of your website / app and cause other problems!)
Note:
not all text editors show them to be different, so choose one that does!
if it is possible, get rid of accents in filenames (if uploaded use a function to sanitise them, or even make them like a slug in Wordpress with sanitize_title() or sanitize_title_with_dashes())

php remove unknown characters

I am building a web application which will run in electron with angular as a frontend framework and laravel as a backend framework. In the application it's possible to login with a smartcard (thanks to node-pcsclite), it reads the bytes on the smartcard and then I convert them.
The smartcard contains a code which is linked to the staff table in my MSSQL database. I can retrieve the code from the smartcard and I can log into the application when it uses mysql as database server.
Now when I'm trying to do the same but with mssql, I get an error which should be viewed in html mode instead of the error page itself.
(The code can be alphanumeric)
So it adds all these strange characters (probably non-existing characters), not that much of a problem right? At least, that's what I thought. So I tried to fix it by using this code inside my laravel controller:
preg_replace('/[^A-Za-z0-9\-]/', '', $string);
This didn't solve anything. Then I thought I might have a problem with the query, so I ran SQL Profiler, the problem is that (probably because of the special characters) the query is broken.
select top 1 * from [Staff] where [CodeInit] = '
go
So does anyone know how to really remove the strange characters?
If you need more information feel free to ask.
I had this problem and landed to this question when searching for a solution. I was unable to find any fix.
The string with non-printable characters retrieved from mdecrypt_generic() so I wanted a way to remove those characters. When I copy and paste the retrieved value from browser to Brackets text editor, it show these red dots.
I just pasted it to google and then it was encoded to %10. Nothing helped till now, so as a temporary solution I just used rtrim() to remove those dots.
Copy the dot in brackets and replace with "DOT_HERE".
rtrim(rtrim($pvp, "DOT_HERE"), "\0\4");
"\0\4" will remove only nulls and EOT but not that dot character(%10).
Further here is a screenshot with that red dot. You can use Brackets text editor to see this.
Note that $pvp is the decrypted text.

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

<expandproperties /> and encoding with phing

I have a static application (html/js) that I must put online in two different languages. While it's not hard to copy the whole directory twice and replace each string, it's kind of boring and I expect to see regular changes on this project.
Thus I considered building the two versions of the project (fr and nl) with phing. The idea would be to use a <filterchain><expandproperties />, and load the translations from a property files.
It works quite well, but for one thing:
unicode characters are represented as \uXXX, which, obviously, is not what I want...
Any idea on how I can fix this ? An excerpt of the build.xml can be found here: http://pastebin.com/2uWHaHvi
SOLUTION: It turns out that it's alright for phing. The problem was that my IDE replaced the character transparently with its \uXXX equivalent. If you are having the same problem, try opening the file with a simple text editor.

Categories