file_get_contents not working with non english filenames in DRUPAL - php

I have a problem.
file_get_contents and other file functions (like file, fopen, glob etc) not working when i try to get file with non english symbols. I getting error that file not exist. It is going when i using any of that functions from my simple drupal module. But same time when i try to use file_get_contents outside drupal's code (just created separated php file) this function work as it should.
Can you advice something?? What drupal doing so i can't use file functions on file with non english name from my module?
Thanks.

Are you urlencode() your filename? If not, you need to.

There is a Transliteration module, I believe it will help you a lot. Some more details about this module (from its project page):
Provides one-way string transliteration (romanization) and cleans file names during upload by replacing unwanted characters.
Generally spoken, it takes Unicode text and tries to represent it in US-ASCII characters (universally displayable, unaccented characters) by attempting to transliterate the pronunciation expressed by the text in some other writing system to Roman letters.
According to Unidecode, from which most of the transliteration data has been derived, "Russian and Greek seem to work passably. But it works quite bad on Japanese and Thai."

Related

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

PHP: How to properly split a unicode Korean string?

I have a problem where I can't seem to be able to write "certain" Korean characters. Let me try to explain. These are the steps I take.
MS Access DB file (US version) has a table with Korean in it. I export this table as a text file with UTF-8 encoding. Let's call it "A.txt"
When A.txt is read, stored in an array, then written to a new file (B.txt), all characters display properly. I'm using header("Content-Type: text/plain; charset=UTF-8"); at the very beginning of my PHP script. I simply use fwrite($fh, $someStr).
WHen I read B.txt in another script and write to yet a new file (C.txt), there's a certain column (obvisouly in the PHP code, I'm not working with a table or matrix, but effectively speaking when outputted back to the original text file format) that causes the characters to show up something like this: ¸ì¹˜ ì–´ëœíŠ¸ 나ì¼ë¡. This entire column has broken characters, so if I have 5 columns in a text file, delimited by commas and encapsulated with double quotes, this column will break all of the other columns' Korean characters. If I omit this column in writing the text file, all is well.
Now, I noticed that certain PHP functions/operations break the Unicode characters. For example, if I use preg_replace() for any of the Korean strings and try to fwrite() that, it will break. However, I'm not performing anything that I'm not already doing on other fields/columns (speaking in terms of text file format), and other sections are not broken.
Does anyone have any idea on how to rectify this? I've tried utf8_encode() and mb_convert_encoding() in different ways with no success. I'm reading utf8_encode() wouldn't even be necessary if my file is UTF-8 to begin with. I've tried setting my computer language to Korean as well..
I've spent 2 days on this already, and it's becoming a huge waste of time. Please help!
UPDATE:
I think I may have found the culprit. In the script that creates B.txt, I split a long Korean string into two (using string ...<br /><br />... as indicator) and assign them to different columns. I think this splitting operation is ultimately causing the problem.
NEW QUESTION:
How do I go about splitting this long string into two while preserving the unicode? Previsouly, I had used strpos() and substr(), but I am reading that the mb_*() function might be what I need.. Testing now.
Try the unicode modifier (u) for preg
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Regex to deny special norwegian letters in friendly url - modx

I'm developing a page using modx revolution. It's a complete cms with a lot of built in functions. If I create a page in the manager it will automatically produce a friendly url for me pointing to that page.
The problem is that is does not deny the special characters we have in Norway, æøå (and uppercase ÆØÅ).
The system got a built in regex-pattern to strip the url for most bad characters, but I need the experession to strip æøå and ÆØÅ too.
The pattern looks like this:
/[\0\x0B\t\n\r\f\a&=+%#<>"~:`#\?\[\]\{\}\|\^'\\]/
Can anyone use their magic regex-knowledge to include these 6 letters? I am totally green at regex, and simply adding the letters in there did not seem to work.
PS: Please don't use the common "boo, don't use regex for this" here. The pattern is there for a reason, and i don't want to mess around with the core if we have to upgrade modx (which is pretty likely to happen sooner or later).
Try to use Unicode. I don't know modx, but since its written in php, I hope it uses php preg regular expressions.
/[\0\x0B\t\n\r\f\a&=+%#<>"~:`#\?\[\]\{\}\|\^'\\\x{00C6}\x{00E6}\x{00C5}\x{00E5}\x{00D8}\x{00F8}]/u
The u modifier tells php to use unicode matching mode, it then interprets the regular expression as unicode string.
\x{00C6} is the Unicode character Æ
Please check the code of the other characters by yourself to ensure I didn't made a mistake while looking them up.
See regular-expression.info for the unicode usage in php
Unicode.org for the code point
MODX actually has a system setting where you can define a custom transliteration class: http://rtfm.modx.com/display/revolution20/friendly_alias_translit_class
However the docs are a bit sparse on how you might implement this. There is an existing package built by one of the core developers which supports alias transliteration for German and Russian, but you can easily add Norwegian or any other language to its configuration:
http://modx.com/extras/package/translit

special characters in file/folder names on Linux; Rename php function not working

I am using a function Rename() (php) to move some images from one folder to another.
The destination folder has special characters in them.
However, when doing this on the server I get the error that the folder isn't found with the name. And in that error, the folder names special characters are replaced with Squares:
Warning: rename(../temp_images/668635375_1.jpg,../ad_images/B�tar/thumbs/668635375_1.jpg)
[function.rename]: No such file or directory in /var/www/etc....
It works on my local machine though (windows xp).
Any ideas?
Troubleshooting tips?
Thanks
I assume this is an encoding problem at some point.
However, using non-ASCII characters in file names is a slippery slope anyway.
I always recommend (since another SO user made me aware of that great and simple idea) that if you can, urlencode() file names and urldecode() them when serving them to the public. This will give you a file name consisting of characters that work on every file system known to me, and can hold any Unicode character.
It is likely an encoding problem: it could be even in the source (in which encoding those "special" characters are written in, in the php source?...), or somewhere else, or both. By "somewhere else" I mean, it could be the right encoding int the string, parsed badly by php, or parsed correctly, but "passed" wrongly from rename() and the underlaying system call (/filesystem) that performs the actual renaming.. In my experience, bad things are likely to occur if you use "special" characters for folders/files that can be read by different systems or accessed through different API... So: do not use "special" characters in folders/files that must be accessible by an http server / php script on a machine system that could be different from the one that "created" the folder/file.
A reading to this could help.

PHP file-handling; Special characters in folder names

I am using rename() to move a file from one folder to another with php.
It works fine with folders which don't have the swedish å ä ö characters involved.
Is there any way around this? (except for changing the folder names to something without special chars)
The website is entirely in utf-8 format...
This seems to be a bit of a grey area looking at the the manual chapter on rename() and the User Contributed Notes. There is no word on what encoding should be used. Anyway, if the filesystem supports it, it should be possible to use UTF-8 in file names.
This SO question has a very clever answer to work around this. It's not 100% pure-bred, but probably workable in most cases.
If the characters you are using are also available in iso-8859-1, you could also try a simple utf8_decode(). But that solution is not complete and not perfect, as it will fail on characters outside the map.
Use the unicode normalize functions to normalize the filepath?
filePath = unicodedata.normalize('NFD', filePath);
this seems to be a bug which i am not sure whether it has been solved or not. You can use the regular expression to clean file/folder names though. Or as pointed out by TheGrandWazoo you can use the normalizer class.

Categories