I have a static application (html/js) that I must put online in two different languages. While it's not hard to copy the whole directory twice and replace each string, it's kind of boring and I expect to see regular changes on this project.
Thus I considered building the two versions of the project (fr and nl) with phing. The idea would be to use a <filterchain><expandproperties />, and load the translations from a property files.
It works quite well, but for one thing:
unicode characters are represented as \uXXX, which, obviously, is not what I want...
Any idea on how I can fix this ? An excerpt of the build.xml can be found here: http://pastebin.com/2uWHaHvi
SOLUTION: It turns out that it's alright for phing. The problem was that my IDE replaced the character transparently with its \uXXX equivalent. If you are having the same problem, try opening the file with a simple text editor.
Related
I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.
Sometimes text on my pages looks very strange, real example:
trained professionals and paraprofessionals coming together
...While the parent div is quite narrow so the text is just sticking out of it.
And it looks quite strange, because actually represents a space.
So, I wonder if it's possible to make the browser account these characters as actual spaces and break the line where necessary without actually replacing them?
EDIT
Why a blind replacing is a problem?
Because may be needed sometimes.
Consider the following example:
Ranks:<br>
Marshall<br>
Leutenant<br>
Sergeant
If I just use a preg_replace on them it would look differently in the end.
(I would also consider some suggestions if you have any ideas on replacing them smartly (for php platform) If you could think of some algorithm that wouldn't affect formatting.)
By definition, is a non-breakable space. It's very meaning is not to be broken across line endings. If this is not what you intend then I suggest fixing the HTML instead of trying to force the browser into non-standard behaviour.
I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.
As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.
You could run it through your existing sanitizer, then anything not latin, you could convert to punycode
So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit
I am currently working on the localization of a website, which was first in english only. A third party company did the translations, and provided us with an excel file with the translations. Which I successfully converted to a PHP array that I can use in my views. I'm using Eclipse for Windows to edit my PHP files.
All is fine, except that I need to add variables in my strings, ex:
'%1 is now following %2'
In arabic I was provided with strings like this one:
'_______الآن يتتبع _______'
I find that replacing __ with %1 and %2 is incredibly difficult because the arabic part is a right to left string, and the %1, %2 will be considered left-to-right, or right-to-left, and I'm not sure . I hardly have the results I expect with the order of my param, because %1 will sometimes go to the left of the string, sometimes on the right, depending on where I start to type. Copy-pasting the replacement strings can also have the same strange effects.
Most of the times I end up with a string like this one:
%2الآن يتتبع %1
The %1 should be at right hand site, the %2 at the left hand site. The %1 is obviously considered right-to-left string because the % appears on the right. The %2 is considered left-to-right.
I'm sure someone as this issue before. Is there any way it can be done easily in Eclipse? Or using a smarter editor for arabic issues? Or maybe it is a Windows issue? Is there a workaround?
UPDATE
I also tried splitting my string into multiple strings, but this also changes the order of the parameters:
'%1' . 'الآن تتبع' . '%2'
UPDATE 2
It seems that changing the replacement string makes things better. It is probably linked to how numbers are handled in Arabic strings. This string was edited in Eclipse without any problem. The order of the parameter is correct, the string is handled correctly by PHP:
'{var2} الآن يتتبع {var1}'
If no other solution is found, this could be a good alternative.
Being an Arabic speaker I get lots of localization tasks. Although I haven't faced this problem in particular but I've had many left-to-right/right-to-left issues while editing. I've had success working with Notepad++.
So here's what I usually do when I want to edit Arabic text
Open empty Notepad++ *
Set encoding to UTF-8 (Encoding -> Encoding in UTF-8)
Enable RTL mode (View -> Text Direction RTL)
Paste your strings
And here's a screenshot showing how I'm editing your string
*: for some reason, whenever I open an already existing file things go bananas. So maybe I'm being superstitious, but this has always worked for me.
Update: First time I did this I was skeptical because the strings looked wrong, but then I did this:
print_r(str_split($string));
and I saw that they're indeed in the correct order.
#Adnan helped me realize and later confirmed that there are issues when mixing Latin numbers with Arabic text.
Based on that conclusion, the solution is simply to stop using %1, %2, %3, ... as placeholders. I will be using more descriptive keywords instead, for example {USER}, {ALBUM}, {PHOTO}, ...
This shows the expected result in the PHP file and it is easily editable:
'ar' => '{USER} الآن يتابع {ALBUM}'
I would prefer the original Notepad for this kind of task.
Open Notepad, make sure you're in LTR mode
Type %1
Change mode to RTL by pressing CTRL + SHIFT
Paste the arabic string into the editor.
Revert back to LTR by pressing CTRL + SHIFT again.
Type %2
Select all with CTRL + A and copy with CTRL + C
Paste into the IDE. It should look weird but execute as expected.
Reason for using Notepad: More complex editor such as Notepad++, Sublime, Coda (Mac), and some IDEs - in your case Eclipse may not use the correct encoding, and Notepad is simple yet works good for multilangual tasks.
I'm stuck on a crazy project that has me looking for a strange solution. I've got a XFA PDF document generated by an outside party. There's are several checkmark characters '✓' on the PDF's that I need to simply change to 'X'. The reason for this is beyond my control. I'm just looking for a way to change the ✓'s into X's. Can anyone point me in the right direction? Is it possible?
Currently we use PHP and TCPDF for creating "our" server PDF's, but this particular PDF is generated outside of my control by a third party that doesn't want to alter their way of doing things. To make things worse, I don't know how many or where the checkmarks may exist. It's just one very specific character that is in need of changing. Does any know a way of hacking the document to change the character?
Character 2713
http://www.fileformat.info/info/unicode/char/2713/index.htm
Yes, I think you can. To my (rather limited) knowledge of the PDF format, you can only reliably search and replace strings of one character in length, since they are created by placing strings of variable length at specific co-ordinates, in an arbitrary order. The string 'hello' could therefore be one string of five letters, or five strings of one letter each or some combination thereof, all placed in the correct position (and in whatever order the print driver decided upon).
I'm afraid I don't know of any libraries that will do this, but I'd be surprised if they don't exist. You'll need to read PDF objects in, do the replacement, and write them out to a new file. I'd start off researching around the answers to this question.
Edit: this looks like it might be useful.