MediaWiki Special Characters in file name issue

MediaWiki Special Characters in file name issue - php

When I upload a file onto mediawiki with Special Characters in the name it mangles the filename into something that looks like this:
Waterford_(DÃ¡il_Ã‰ireann_constituency).png
when it actually writes it to the file when it should actually look like this:
Waterford_(Dáil_Éireann_constituency).png
Which means that when another page links to that file it comes up as a broken image link because it's looking for
http://mysite.com/wiki/images/Waterford_(Dáil_Éireann_constituency).png
I don't want to prevent people from using special characters as often they copy the files from wikipedia which supports special characters, and I think it's something to do with the way my host handles files.
So it would be preferable if there was some way to intercept the way that mediawiki creates the files so that the filepath would be free from special characters while all the references would still work.

MediaWiki is 100% UTF-8 safe, but something somewhere in your Apache/PHP configuration is mangling UTF-8 into ISO 8859-1 (Latin1). Start by making sure your PHP install has mbstring enabled as specified here:
http://www.mediawiki.org/wiki/PHP_configuration
If you're stuck with a messed-up host, then this Talk page has some clues for stabbing your Wiki with a knife to make it dumb down filenames:
http://www.mediawiki.org/wiki/Manual_talk:Configuring_file_uploads#Image_File_Names

I can't speak on how and where to do this in Mediawiki, but a good way to do this would be to urlencode() the file name. Percent encoded file names should work on all platforms, and it's easy to restore them into proper form when necessary.

Related

Encoding of Files for PHP project

I am developing a php project which is in HTML5. Following is the meta used for all pages in my website.
<meta charset="utf-8">
I am coding in windows machine using NetBeans. I was not really aware of encoding of the files. Since the code was working fine, i was not giving importance for this.
However, based on some of the questions in stackoverflow, I could understand more about encoding. I noticed that many php/js/css files of my project are saved in UTF-8 encoding whereas some php/js/css files are saved in ANSI encoding. (to understand this, i opened the file in notepad, clicked on save as and checked the default encoding shown).
It seems the files in which I pasted some of the unicode characters were autosaved in UTF-8 and all other files were saved in ANSI encoding (I guess it might be Windows-1252). All this happened even though I set project preference as UTF-8 in netbeans.
Is it required to save those files (files which does not use unicode) also to UTF-8 as my html meta says UTF-8? (Note that there are no issues when I tested my website, but my testing was from a windows machine)
I am also curious to know, how the browser render the web page correctly though some of the php files are saved in ANSI but served with meta UTF-8.

(to understand this, i opened the file in notepad, clicked on save as and checked the default encoding shown).
This isn't an accurate way of checking the encoding of a file.
Files which contain only ASCII characters -- like most CSS and Javascript source files! -- are valid in most text encodings. Notepad will call them "ANSI" because that's its default, but they're also perfectly valid as UTF-8. No conversion is necessary.

Moved wordpress : broken uploaded files with accents

I moved my wordpress website from my old server to my new. Everything works, except that every uploaded files (PDFs for example) with accents (é,à, and so on) in the file name got their URL broken. So I would have to rename every files with accents, and change my links that are pointing to these files.
The images are showing though. It really seems to be only affecting files with accents in the filename.
Any solutions for this? I just wanted to move the wordpress website to my new server. It worked, but I would have to rename all these uploaded files.
Thanks!

I had such a problem, i dont have 100% correct answer but i want help you.
I remember it was problem with ftp program (i dont remember, maybe winscp?)
anyway this program does not interact with server and changed accent characters. You can try do it with other program or from server shell.
Second thing, we known which character is wrong and my friend wrote plugin to change characters but if your have you problem from other side (char to accent) this cant work...
Find where is your problem, possible you can fix it with plugin like this: https://wordpress.org/plugins/clean-image-filenames/
Last important thing - check whether your database use correct encoding
Sorry for english ;)

PHP file losing formatting after FTP upload

I am using WinSCP to transfer files to an FTP site. I have a situation currently where one specific file within a folder loses all of its formatting when it is uploaded causing the PHP file to no longer work.
All other PHP files within the folder work correctly when uploaded.
I can't understand why just one file could be affected in this way. Can anyone shed any light on the situation?

The file was probably transferd via ASCII mode which will modify the encoding and the line endings of the file.

As you have not stated what exactly do you mean by "losing formatting", it's difficult to answer, anyway:
As per src's answer, if the line endings change due to ASCII/text mode transfers, the resulting converted file can be perceived as if it lost formatting, if opened in an editor that does not support the target line endings. Though that hardly explains why there's only one affected file. Although can WinSCP technically choose a different transfer mode for example based on a file size or modification timestamp, if configured so, I doubt you did. Also note that WinSCP defaults to binary transfer mode. It would help if you state what transfer mode do you actually use with WinSCP. Definitive source for this information is WinSCP session log file. Also sharing relevant part of a log file would also help with investigation.
Another possibility is that the affected source file was created with a different line endings in the first place (like in a different editor than you use usually). As such the problem would have nothing to do with transfer mode, or WinSCP. And the difference is possibly revealed only after you open the files using a third editor on the remote side that supports only one of the line ending formats.
Though in both of these cases, the file should still work in PHP, as PHP supports both Unix and Windows line endings. Possibly the source file has such a strange format that during ASCII/text mode transfer, the server got confused and converted the file incorrectly. But that's just a wild guess.
Again, we need more information to help you.

Migrate web-pages from different char-sets to UTF-8

For the last years I used Notepad++ on Win XP SP2.
As I just have seen, the setting in Notepad++ is to encode new files in "ANSI" in "Windows Format". Basically all files on my harddisk should be ANSI files then, but I'm not sure.
Most .html-files have a charset-tag as "text/html; charset=iso-8859-1", but some have none.
Other files, especially text-files (for example keyword-lists) I stored with Firefox XPCOM-system, I don't know how they are currently encoded.
On Server-side I have Apache with PHP and MySql.
For Upload I used Filezilla.
Now the problem is: I want to use Japanes signs (or arabic, etc.). This only works partly.
I can get my selfmade Firefox-Application to constantly write or read UTF-8. But I can't check everytime which of the old files is which encoding.
Having just read Joel Spolsky's old article about UTF-8 strengthens my view that I simply have to get my whole system changed as much as possible to UTF-8.
As long as I have it running that way locally on my Hard-Disk I could just re-upload everything to the server.
So: How do I get all my files locally transfered to UTF-8?
And: Is it possible at all to have Win XP SP2 using constantly UTF-8 everywhere? Or do I have to check it with every program, or even worse with every file, that the right encoding is to be used.
How about files I get for example in E-Mails or via an USB-stick, or that I download in zip-files? (Or a thousand possibilities more.)
Update:
1.-4. went OK so far. I tried first with BOM, but without seems to be better.
So to 5.) Something I have to change there too. I changed as in 3.) the charset in the html-template-file, and the text coming from the template is displayed correctly. But the text coming from MySql/Php shows the UnknownChar-sign at some places currently, i.e. where there should be Umlaute äöü.
I have changed all collations for text fields in the MySql-Database via phpmyadmin to "utf8_unicode_ci", but that didn't do the trick.
Is it a php-issue, or do I only have to convert somehow the data in the MySql-Database once?

The beauty of UTF-8 is that it's a superset to ASCII, so if your html and php files only contain Latin alphabets (i.e. English and programing/HTML syntax), you don't need to convert the file at all. You can leave most of your file unchanged.
Should you find few exceptions that you want to convert it manually, you may open them up in Notepad++, and do 'Encoding' - 'Convert to UTF-8 (No BOM)'.
Yes, you do need to change/add <meta> charset tag to all the HTML files to make sure the browser render your files in UTF-8.
In Notepad++ you could set the new file to always open with 'UTF-8 (No BOM), Unix'. Also, check the tick on "Apply to ANSI files" so old file can be correctly saved to the new encoding. I suggest the format is because even though you are working on a Windows machine, the web servers usually runs Linux/BSD so the format is the native form (keeping files in native form is important especially when you are using a version control system).
Migrate a live site with database is a different issue. Data in MySQL comes with their own encoding, and from your question I cannot tell if you need to do it and how to do it. Need more specifics on that (if you need to).

How to remove excess whitespace added to code by FTP program?

This happens repeatedly and is very annoying. I upload some PHP code to a client's server. A few weeks pass. They ask for a change to be made and I re-download the code as they've made some changes. However, my code which used to be neat and tidy the last time I looked at it now has an extra lines of whitespace added everywhere. So now where I had two lines of space between some code, it now has 3. Where I had a bunch of lines sticking together because they were part of the same for loop or such, they're all scattered around now and there's no way to distinguish them.
Is there any program/utility to fix this?

Upload in binary mode instead of ascii. Ascii mode is changing all your linefeeds (unix end of line character) into carriage returns + linefeeds (Windows end of line characters).

You may also be having a problem with the other editor using tabs when you are using spaces (you are using spaces, right?). I have seen similar problems when sharing source between developers on Linux/OSX and Windows.

I'll bet this is caused by systems trying to convert text files created in a Windows System to files being used by a Unix/Linux system (and back again).
Windows uses both a carriage return and a line feed and I think Unix uses only a line feed (or was it a carriage return).
I used Ultra Edit as my main text editor (not that Emacs and vi don't rule :o) and it has a DOS mode and a Unix mode for just this kind of thing.

Force your client to transfer all files in binary mode.
This is also useful when you have unicode text in your files, you never know what assumptions text-mode might make!

At a guess, I would say that you are developing on a Unix / Max system, but the customer is running on a Windows system (or vice-versa). I would also guess that you are uploading / downloading your files in Binary mode.
The windows editor is probably converting the Unix LF to a LF/CR pair, which your editor thren treats as two new lines.
If you upload and download in ASCII mode, the files will be automatically converted between the two formats.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.