utf8 filenames and greek chars - php

I'm trying to figure this out but I'm quite puzzled at the mo.
I have a directory in my website containing pdf files with greek filenames (ie ΤΙΜΟΚΑΤΑΛΟΓΟΣ.pdf)
I want to have links for the files on a web page so that users can open or save the files.
So far I can list the files ok but if I click on them I get a 404 error. It's as if the server thinks they're not there although they are.
I understand it's problably an encoding issue but beyond that I'm not sure what to look for. The website encoding is utf-8 and in order to display the filenames correctly I had to use mb_convert_encoding($file->filename, 'utf8', 'iso-8859-7').
This is the url: http://www.med4u.gr/timokatalogoi/
This is the directory listing: http://www.med4u.gr/pricelists/
The site is based on Joomla and it's hosted on a linux server.
Any ideas?

ISO-8859-* MUST DIE! (That's not personal!) Do everything in UTF-8. Everything. With good reason, some of us get upset when we see them being used, especially Latin-1 (8859-1) which bites a lot of people. I think you would find it very helpful to just dump them and move on to UTF-8.
Things to check:
Store your files encoded in UTF-8: Usually no difficulties with that.
Make sure your server is sending the files with UTF-8 charset: add header('Content-Type: text/html;charset=UTF-8'); near the top of your PHP.
Just in case someone saves your page, it's helpful in that case to put the same thing in a <meta> tag in the head.
Check it all in your browser: right click, view page info, and make sure the encoding is right.
CPanel is very flexible, so that's all doable without much fuss. Feel free to comment if you want more detail.
If you have a database, there are a few more hoops to jump through, but it's worth it. With UTF-8 you never have to worry, and it's the definitive, future-proof way of doing things.

Let's suppose for the sake of argument that the file name on disk is aa.pdf but your conversion displays it as ab.pdf. You need either to revert the conversion so it points back to aa.pdf, or teach the server to remap or redirect requests for ab.pdf to this file. Or if you prefer, rename the file to ab.pdf instead, if your file system can handle this name.

It's definitely an encoding problem. You'll need to escape the URL, or convert it to whatever character set your server recognises.
e.g. 'ΤΙΜΟΚΑΤΑΛΟΓΟΣ LASER.pdf' in iso-8859-7 = 'ÔÉÌÏÊÁÔÁËÏÃÏÓ LASER.pdf' in iso-8859-1

Related

Is it safe to use raw emojis in PHP source code?

Example :
$fire = '🔥';
I know PHP 5+ supports this functionality natively but is it best practice or should I be storing them using their codepoints instead and if so, why?
As far as your editor and the PHP compiler are concerned, it's all just text, and '🔥' is no different from 'fire' or 'Φωτιά'.
When PHP runs, it will read the bytes in from the file and put them in memory, without caring what they mean. This leads to the most likely problem you'll have: if you save the file in your text editor as UTF-16, and then echo the string to a browser telling it that it's UTF-8, the browser won't show the right thing. But that's easily avoided by making sure your editor always uses UTF-8, and your output headers tell the browser that's what you're using.
If you don't trust your editor to do that, and you're running PHP7, you could write it in the escaped notation "\u{1f525}", but when it runs, the same bytes will end up in memory.
You might have similar problems if you send the text elsewhere - to a database, for instance - and that somewhere else doesn't know to handle it as UTF-8. How you write the string in your source file won't make any difference to that, though, that's just a case of making sure everything is configured to match.
Note: you don't actually have to use UTF-8 for this, you could use UTF-16, or some other encoding, as long as you're consistent; but UTF-8 is by far the most common these days, particularly on the web.

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

UTF8 doesn't get parsed by php, server related ? not file related

this is so weird, this has never happened to me. I've worked with utf-8 alot and this is the first time happening,
Since last week all my sites that had utf8 characters in files are now showing ? instead of the actual character!
the files are ok and I can see characters fine if I edit them,but after it gets processed by php it changes the utf-8 characters with ?.
the utf8 characters that were stored in database are loading just fine , but the problem is with the strings that are in php files.
Notice I said since last week, this means it happened all of a sudden and obviously something changed on server.
I contacted my hosting company but they have no clue what to look for and I don't know what to tell them to look for.
any clue what could have been changed on the server?
so to conclude:
it's not a database problem
it's not a file encoding problem (I hope not, I have 30+ sites with different cms on each one, can not afford to edit them all)
it's not a content-type issue in html because it's getting parsed by php and turns utf8 characters to ?
it could alse be a wordpress problem,but I'm sure this happend after some changes on server side
screenshot1
screenshot2
it's not a database problem - Check
it's not a file encoding problem - THIS actually could be it
it's not a content-type - Check (but make sure you write UTF-8 in meta tag lowercase !)
Wordpress problem - Maybe with combination of file encoding
I can imagine situation, when you deleted/disabled mb_string module for PHP and then edited your template using wordpress. Then your characted got shattered.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Getting funny squares in browser when displaying content

I have content stored in a Postgres DB, now everytime I call the content so that it gets displayed using php, i get funny squares in IE and funny square type question marks in Firefox?
Example below
* - March � May 2009
How do I remove this?
I do not have access to the server so can't adjust the encoding there, only have postgres DB details and FTP access to upload my files
I would also recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, I've read it only recently myself, it will definitely help you sort out your problems.
You need to make sure that Postgres, PHP, and your browser all agree on the content encoding, and that you have an appropriate font selected in your browser. The simplest way to do that is to choose UTF8 for everything.
I don't know about PHP, but I do know about databases and browsers. First you need to find out if the database is UTF8. (From psql, I would do a "\l" and look at the encoding.) Then you need to find out if PHP supports UTF8 (I have no idea how you do that). Then you need to see if how those characters are being stored in the database by the PHP app. Then you need to figure out if the web server is correctly reporting the content encoding. (On Linux/Unix, I'd use the program "HEAD" (not "head") to see the headers its returning.) And then you need to figure out if your browser is using a font that supports UTF8.
Or, you could just make sure you only store ASCII and forget the rest of the world exists. Not recommended.
Wrong charset somewhere. The characters could be stored wrong already in database, or you have wrong charset in meta tags on the page(try manually change charset in browser), or there could be problem with wrong encoding when page is communicating with database.
Check this page http://www.postgresql.org/docs/8.2/static/multibyte.html for more informations.
Try to have same encoding on all places, preferably UTF-8
You have encoding issues. Make sure the encoding is set right in the database, in the html markup and make sure the files themselves are saved in proper encoding.

Categories