Encoding Dilema ! %C3%A9 vs e%CC%81

Encoding Dilema ! %C3%A9 vs e%CC%81 - php

if you look to the two links bellow they look the same, but one of them works and the other is not working, after analysing the problem it seems that ther's a diffrence in interpretation of the "é" character and all the accentuated characters and the encoder either treat it as on char or the letter without accentuation + the accetnuation char
This problem is causing the images on website to be broken but they are here in the FTP
The question is how to fix that, is the fix in wordpress, database or server ?
Thank's and sorry for my poor english.
http://r20med.regions20.org/wp-content/uploads/2016/07/Portes-Ouvertes-sur-le-tri-sélectif-à-Hai-Essabah-Oran_017.jpg
http://r20med.regions20.org/wp-content/uploads/2016/07/Portes-Ouvertes-sur-le-tri-sélectif-à-Hai-Essabah-Oran_017.jpg

If you look at the screen capture below (taken from a text editor after copy-pasting your links above), you can see the difference.
The easiest solution I found was to change the filenames in an editor that only uses the accentuation you can see in the first row.
Conclusion: never trust uploads! Always check them for everything that's around! :) (This is true for filenames, texts, HTML, etc. - even if they are not intended to be harmful, they may block functions of your website / app and cause other problems!)
Note:
not all text editors show them to be different, so choose one that does!
if it is possible, get rid of accents in filenames (if uploaded use a function to sanitise them, or even make them like a slug in Wordpress with sanitize_title() or sanitize_title_with_dashes())

Related

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.

Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.

There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

How to force line-breaks on ?

Sometimes text on my pages looks very strange, real example:
trained professionals and paraprofessionals coming together
...While the parent div is quite narrow so the text is just sticking out of it.
And it looks quite strange, because actually represents a space.
So, I wonder if it's possible to make the browser account these characters as actual spaces and break the line where necessary without actually replacing them?
EDIT
Why a blind replacing is a problem?
Because may be needed sometimes.
Consider the following example:
Ranks:<br>
Marshall<br>
Leutenant<br>
Sergeant
If I just use a preg_replace on them it would look differently in the end.
(I would also consider some suggestions if you have any ideas on replacing them smartly (for php platform) If you could think of some algorithm that wouldn't affect formatting.)

By definition, is a non-breakable space. It's very meaning is not to be broken across line endings. If this is not what you intend then I suggest fixing the HTML instead of trying to force the browser into non-standard behaviour.

(?) marks in HTML. Encoding issue from content from the database?

Any idea why this is happening?
It looks to be happening mainly with apostrophes and hyphens. Any ideas if I can fix this? I pull the data from my database and print it to the page like:
<div class="block">
<?=$details['agenda'] ?>
</div>

As other commenters may have mentioned, this is a character encoding problem. If you're lucky, you can force your HTML page to render in UTF-8 and that will resolve it.
Unfortunately, if you're not lucky, you'll discover that the characters are stored in the database in the wrong encoding. Or maybe the database converts them. Or maybe the character encoding data has been destroyed along the path! There's no way of knowing in advance where those characters have been damaged.
The best way I know to fix problems like this is to force every step along your path to follow UTF-8 content encoding. For example, you probably go through steps like this:
Content author writes a document in Microsoft Word containing "SmartQuotes"
Content author copies-and-pastes into the edit box of a content management system.
Content management system saves to the database.
Database may or may not store data in Unicode internally - make sure you use nvarchar (or whatever unicode type your database supports).
Reading from the database may need to scan for characters.
However, it's very tricky to fix this! A long time ago, I used to have a habit of writing "detect-and-fix" routines like this:
$smartquotes = array("”", "“");
str_replace($smartquotes, '"', $mytext);
Of course you know what the problem is - I'd keep discovering new characters I had to fix. Microsoft Word likes to do tons of unusual characters - copyright, registration marks, apostrophes, hyphens, and so on. I'd keep adding to this function, over and over, until I went crazy. So nowadays I just go through my entire content delivery path and force everything to obey UTF-8 rules; that seems to resolve it in most cases.
Good luck!

Handle arabic string in PHP with Eclipse

I am currently working on the localization of a website, which was first in english only. A third party company did the translations, and provided us with an excel file with the translations. Which I successfully converted to a PHP array that I can use in my views. I'm using Eclipse for Windows to edit my PHP files.
All is fine, except that I need to add variables in my strings, ex:
'%1 is now following %2'
In arabic I was provided with strings like this one:
'_______الآن يتتبع _______'
I find that replacing __ with %1 and %2 is incredibly difficult because the arabic part is a right to left string, and the %1, %2 will be considered left-to-right, or right-to-left, and I'm not sure . I hardly have the results I expect with the order of my param, because %1 will sometimes go to the left of the string, sometimes on the right, depending on where I start to type. Copy-pasting the replacement strings can also have the same strange effects.
Most of the times I end up with a string like this one:
%2الآن يتتبع %1
The %1 should be at right hand site, the %2 at the left hand site. The %1 is obviously considered right-to-left string because the % appears on the right. The %2 is considered left-to-right.
I'm sure someone as this issue before. Is there any way it can be done easily in Eclipse? Or using a smarter editor for arabic issues? Or maybe it is a Windows issue? Is there a workaround?
UPDATE
I also tried splitting my string into multiple strings, but this also changes the order of the parameters:
'%1' . 'الآن تتبع' . '%2'
UPDATE 2
It seems that changing the replacement string makes things better. It is probably linked to how numbers are handled in Arabic strings. This string was edited in Eclipse without any problem. The order of the parameter is correct, the string is handled correctly by PHP:
'{var2} الآن يتتبع {var1}'
If no other solution is found, this could be a good alternative.

Being an Arabic speaker I get lots of localization tasks. Although I haven't faced this problem in particular but I've had many left-to-right/right-to-left issues while editing. I've had success working with Notepad++.
So here's what I usually do when I want to edit Arabic text
Open empty Notepad++ *
Set encoding to UTF-8 (Encoding -> Encoding in UTF-8)
Enable RTL mode (View -> Text Direction RTL)
Paste your strings
And here's a screenshot showing how I'm editing your string
*: for some reason, whenever I open an already existing file things go bananas. So maybe I'm being superstitious, but this has always worked for me.
Update: First time I did this I was skeptical because the strings looked wrong, but then I did this:
print_r(str_split($string));
and I saw that they're indeed in the correct order.

#Adnan helped me realize and later confirmed that there are issues when mixing Latin numbers with Arabic text.
Based on that conclusion, the solution is simply to stop using %1, %2, %3, ... as placeholders. I will be using more descriptive keywords instead, for example {USER}, {ALBUM}, {PHOTO}, ...
This shows the expected result in the PHP file and it is easily editable:
'ar' => '{USER} الآن يتابع {ALBUM}'

I would prefer the original Notepad for this kind of task.
Open Notepad, make sure you're in LTR mode
Type %1
Change mode to RTL by pressing CTRL + SHIFT
Paste the arabic string into the editor.
Revert back to LTR by pressing CTRL + SHIFT again.
Type %2
Select all with CTRL + A and copy with CTRL + C
Paste into the IDE. It should look weird but execute as expected.
Reason for using Notepad: More complex editor such as Notepad++, Sublime, Coda (Mac), and some IDEs - in your case Eclipse may not use the correct encoding, and Notepad is simple yet works good for multilangual tasks.

Removing characters from a PHP String

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.

Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.

You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.

That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.

If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters

Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.

Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.

Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.