php5 encoding : I don't detect turkish characters - php

I have a php script which detects keyword density on given url.
My problem is, it doesn't detect turkish characters or deletes them.
I'm getting contents of url by file_get_contents method. This method works perfect and gets all content with turkish characters.
You can see my code here or try script here.

You seem to be fetching and converting the file_get_contents data as UTF-8 (probably correctly), but your HTML page is not specifying an encoding for itself. So probably, any incoming form data is in iso-8859-1. Try specifying utf-8 as your page's encoding as well:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
And the obligatory reading link on encoding basics: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related

Incorrect rendering of Language (e.g. Arabic)

I apologize if this question is not directly related to programming. I'm having an issue, of which I have two examples;
I have a website, where I store Arabic words in a DB, and then retrieve it, and display it on a page, using php. (Here's the link to my page, that is displaying Arabic incorrectly.)
I visit any random website, where the majority of the content is supposed to be in Arabic. (An example of a random website that gives me this issue.)
In both these cases, the Arabic text is displayed as 'ÇáÔíÎ: ÇáÓáÝ ãÚäÇå ÇáãÊÞÏãæä Ýßá'... or such weird characters. Do note that, in the first case, I may be able to correct it, since I control the content. So, I can set the encoding.
But what about the second case [this is where I want to apologize, since it isn't directly related to programming (the code) from my end] - what do I do for random websites I visit, where the text (Arabic) is displayed incorrectly? Any help would really be appreciated.
For the second case:
This website is encoded with Windows-1256 (Arabic), however, it wrongly declares to be encoded with ISO 8859-1 (Latin/Western European). If you look at the source, you can see that it declares <meta ... charset=ISO-8859-1" /> in its header.
So, what happens is that the server sends to your browser an HTML file that is encoded with Windows-1256, but your browser decodes this file with ISO 8859-1 (because that's what the file claims to be).
For the ASCII characters, this is no problem as they are encoded identically in both encodings. However, not so for the Arabic characters: each code byte corresponding to an Arabic character (as encoded by Windows-1256) maps to some Latin character of the ISO 8859-1 encoding, and these garbled Latin characters are what you see in place of the Arabic text.
If you want to display all the text of this website correctly, you can manually set the character encoding that your browser uses to decode this website.
You can do this, for example, with Chrome by installing the Set Character Encoding extension, and then right-click on the website and select:
Set Character Encoding > Arabic (Windows-1256)
In Safari, you can do it simply by selecting:
View > Text Encoding > Arabic (Windows).
The same should be possible with other browsers, such as Firefox or Internet Explorer, too...
For the first case:
Your website (the HTML file that your server sends to the browser) is encoded with UTF-8. However, this HTML file doesn't contain any encoding declaration, so the browser doesn't know with which encoding this file has been encoded.
In this case, the browser is likely to use a default encoding to decode the file, which typically is ISO 8859-1/Windows-1252 (Latin/Western European). The result is the same as in the above case: all the Arabic characters are decoded to garbled Latin characters.
To solve this problem, you have to declare that your HTML file is encoded with UTF-8 by adding the following tag in the header of your file:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Display chinese characters WITHOUT using utf8 encoding?

I'm fetching rows from a MySQL database with a unicode_general_ci collation. Columns contains chinese characters such as 格拉巴酒和蒸馏物 and I need to display those characters.
I know that I should work in utf-8 encoding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but I can't: I'm working on a legacy application where most of the .php files are saved as ANSI and the whole site is using:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Is there any way to display them?
Bonus question: I've tried to manually change the encoding in Chrome (Tool -> Encodig -> UTF-8) and It seems it doesn't work: page is reloaded but ???? are displayed instead of chinese characters.
You can display 格
using the numeric entity reference 格, etc. The encoding of the page should not matter in this case; HTML entity references always refer to Unicode code points.
PHP has a function htmlentities for this purpose, but it appears that you will need workarounds for handling numeric entities. This json_encode hack is fairly obscure, but is probably programmatically the simplest.
echo preg_replace('/\\\\u([0-9a-f]{4})/', '&#x$1;',
preg_replace('^/"(.*)"$/', '$1', json_encode($s)));
This leverages the fact that json_encode will coincidentally do the conversion for you; the rest is all mechanics. (I guess that's PHP for you.)
IDEone demo
Your "bonus question" isn't really a question, but of course, that's how it works; raw bytes in the range 128-255 are only rarely valid UTF-8 sequences, so unless what you have on the page is valid UTF-8, you are likely to get the "invalid character" replacement glyph for those bytes.
For the record, the first two Chinese Han glyphs in your text in UTF-8 would display as 格拉 if mistakenly displayed in Windows code page 1252 (what you, and oftentimes Microsoft, carelessly refer to as "ANSI") -- if you have those bytes on the page then forcing the browser to display it in UTF-8 should actually work as a workaround as well.
For additional background I recommend #deceze's What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text.
I'm not sure that you can. iso-8859-1 is commonly called "Latin 1". There's no support for any Asian kanji-type languages at all.
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.

Advice Build Web Sites using ISO-8859-1 or UTF-8

I everybody, i have a decision to make about making a web site in spanish, and the database has a lot of accents and special characters like for example ñ, when i show the data into the view it appears like "Informática, Producción, Organización, Diseñador Web, Métodos" etc. So by the way, i am using JSP & Servlets, MySQL, phpMyAdmin under Fedora 20 and right know i have added this to the html file:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
and in the apache, i change the default charset:
#AddDefaultCharset UTF-8
AddDefaultCharset ISO-8859-1
but in the browser, the data continue appearing like this: "Informática, Producción, Analista de Organización y Métodos", so i don't know what to do, and i have searching all day long if doing the websites using UTF-8 but i don't want to convert all accents and special characters all the time, any advice guys?
The encoding errors appearing in your text (e.g, á instead of á) indicate that your application is trying to output UTF-8 text, but your pages are incorrectly specifying the text encoding as ISO-8859-1.
Specify the UTF-8 encoding in your Content-Type headers. Do not use ISO-8859-1.
It depends on the editor that has been done anywhere, whether at work by default in UTF-8 or ISO-8859-1. If the original file was written in ISO-8859-1 and edit it in UTF-8, see the special characters encoded wrong. If we keep that which file such, we are corrupting the original encoding (bad is saved with UTF-8).
Depending on the configuration of Apache.
It depends on whether there is a hidden file. Htaccess in the root directory that serves our website (httpdocs, public_html or similar)
Depends if specified in the META tags of the resulting HTML.
Depends if specified in the header of a PHP file.
Charset chosen depends on the database (if you use a database to display content with a CMS such as Joomla, Drupal, phpNuke, or your own application that is dynamic).

utf8 not supporting for other languages why?

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
it is not working like i have my site which can be translated in 20 languages but in some languages like turkish , japanese it shows � symbol instead of space or " and many others
Since I don't know your site I can just guess in the dark.
Setting
<meta charset="utf-8" />
will not be the only thing you have to do. If your document is saved as ASCII your problems won't be solved. Additionally you have to set the document encoding correctly (the meta tag just tells the browser which encoding to use, not which one IS actually used). So open the document with a (good) text editor like SublimeText / Notepad++ or what you prefer and set the encoding to UTF-8.
for php, you need to add a utf-8 header
header ('Content-type: text/html; charset=utf-8');
Letting know browser that text is in unicode and actually providing data in unicode is not the same. Check your files for unicode, database data for unicode and transformation that is done with it while serving. Provide more information to pinpoint your problem
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Just adding a Content-Type header in HTML doesn't make anything utf-8. It merely tells the browser to expect utf-8. If your source files are not in utf-8, or the database columns in which data is stored isn't utf-8, or if the connection to the database itself isn't utf-8, or if you're sending a HTTP header telling it isn't utf-8, it will not work. There's just one way of dealing with utf-8: make sure everything is in utf-8.
The problem is caused by the admin tool you are using. The tool injects data into a UTF-8 encoded data in some other encoding. As the tool has not been described, the specific causes cannot be isolated. The pages mentioned do not exhibit the problem, and they specify the UTF-8 encoding in HTTP headers, so the meta tag is ignored in the online context (though useful for offline use).

Strange characters appearing after copy/pasting in forms/emails

I have users who sometimes paste things into my site's forms after copying something from their Gmail. The characters look normal when they paste it, but in the database they have extra special characters that appear.
Here is an example of the text with the the special characters.
It originally happened on this page:
http://www.hikingsanfrancisco.com/hiker_community/scheduled_hike_event.php?hike_event_id=91
But it looks like the person who made it has cleaned up the strange characters.
Does anyone know how to stop this from happening in the future?
Thanks,
Alex
I use PHP and MySQL
I'd guess that you're getting UTF-8 encoded text but your database is configured for ISO-8859-1 (AKA Latin-1). The page you reference says:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
so it is claiming to be encoded as UTF-8. A form on a UTF-8 page will be sent back to the server in UTF-8. Then you send that UTF-8 data into your database where it is stored as Latin-1 encoded text. If you're not handling the UTF-8 to Latin-1 change yourself then you'll get "funny" characters when you send the data back to a browser. As long as the text only uses standard ASCII characters then everything will be okay as UTF-8 and Latin-1 overlap on the ASCII characters.
The solution is to pick a character encoding and use it everywhere. I'd recommend UTF-8 everywhere. However, if your database is already in Latin-1 then you'll have to go with Latin-1 or change the encoding in the database and re-encode all the data. But, if all the text in your database is simple ASCII then no re-encoding will be needed.
Hard to say what's going without examples but a character encoding mismatch is the usual problem when funny (funny peculiar, not funny ha-ha) characters appear only when text is sent back to the browser.

Categories