HTML source code doesn't understand Arabic text? - php

I'm trying to read a source code for a webPage that contains Arabic text but all what am getting is this جامعة (which is not Arabic, only a group of characters).
If I reload the page on my localhost I get the Arabic tags and text correctly.
But I really need to read that source code. any suggestions or lines of code I can add?
<html dir=rtl>
<META http-equiv=Content-Type content=text/html;charset=windows-1256>
These are few lines from that include the "encoding" used! The page is written using HTML and PHP

The characters are merely escaped to HTML entities. The browser decodes them to "real characters" when it renders the page. You can decode them yourself using html_entity_decode:
html_entity_decode('جامعة', ENT_COMPAT, 'UTF-8')
Note the last parameter, which sets the encoding the characters will be decoded to. Use whatever encoding you're working with internally, I'm just suggesting UTF-8 here.

Related

Japanese and Russian characters - web encoding?

I have a Zope/Plone WS that calls some functions written in Python.
That WS are called by PHP pages (utf-8 into header) but characters aren't visible.
I've tried to decode (where possible) special chars into entities (into Python) and that works, but not all chars have corresponding HTML entities.
I've tried to save the original Python file in UTF-8 format, but I thought that wasn't the right way.
Can someone help?
note : I pass through some php include, if this could be an hint...
Edit it's weird, because if I log all the "pieces" singly, then I have the right chars encoded. If I go up to the "main php page" (where I include all pieces), that messes up everything.
Obviously, the "main php page" has that:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
496e73e972657220646174652064926172726976e9652065742064652064e970617274
That string is encoded in ISO-8859-1, not UTF-8.
Somewhere you're converting your strings to ISO-8859-1, which means they're not interpreted correctly when trying to interpret them as UTF-8, and all non-European characters will be discarded since ISO-8859-1 can't encode anything but a handful of European characters.
I just edited the file site.py of python.
I follow that guide: click here and everything is ok now.
Thank you all for help.

HTML - Mixing UTF-8 coming from MySQL database and special chars into HTML

I have a database where everything is defined in UTF-8 (charsets, collations, ...).
I have a PHP page that gets datas from that database and display it.
That PHP page contains some hard text with special charaters, like é, à, ...
My PHP page has meta charset defined to utf-8.
I call mysql_set_charset("utf8");
My PHP page is written on an editor that is configured to encode to utf-8 Unicode (Dreamweaver CS4, there is no other utf-8 option)
Anything coming from the database is ok, but...
I can't display well the hard special characters (é, à, ù, ...).
Same problem when I use strip_tags(html_entity_decode($datafromdatabase)); on datas coming from database. Here it's really problematic.
What may I do to keep using UTF-8, but being able to display well the special chars without having to use their html equivalent (é, &agrave, ...) ?
EDIT
The problem with hard characters was coming from the php page that was not saved using adhoc encoding. I have created a new document copyed/pasted the old code into that new page, and saved it over the old page. No more problem with hard characters.
But I still have problems with strip_tags(html_entity_decode($datafromdatabase));
using $datafromdatabase = htmlentities(strip_tags(html_entity_decode($datafromdatabase)), ENT_COMPAT, "UTF-8") does not solve the problem. I have stange characters starting with # for each é, à, ù in the text coming from the database (stored as &eacute, ...)
I looks like it's a problem with your browser properly displaying the characters rather than saving.
Check two things.
Issue a utf8 http header
header( 'Content-Type: text/html; charset=UTF-8' );
And make sure your html declaration is mentioning utf8
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
That's for html 4
If your document is properly encoded, this should do it.
The problem with hard characters was coming from the php page that was not saved using adhoc encoding. I have created a new document copyed/pasted the old code into that new page, and saved it over the old page. No more problem with hard characters.
For the problem coming from strip_tags(html_entity_decode($datafromdatabase)); I had in fact to use strip_tags(html_entity_decode($datafromdatabase, ENT_QUOTES, "UTF-8"));

Weird characters appear after I use php's mb_substr() on a string

I'm developing a web site with PHP (5.3.5, Ubuntu) and all the content is in Spanish. I would like to cut the text when it doesn't fit the space designated for it. I have the following meta tag in the php file where I want to do this: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />.
The text comes from a MySQL database where charset is latin1 and collation latin1_spanish_ci. I'm trying to cut the text with the mb_substr() function. But it isn't working correctly. For example, let's say I want to cut Short Psicodélico to Short Psicodéli, the function would be:
mb_substr('Short Psicodélico', 0, 15, 'ISO-8859-1');
But the result is something like this: Short Psicod&ea. The e with the diacritic is transformed in &ea and I don't know why. I think it has something to do with the character encoding but I don't know exactly how. If I don't use this function the characters appear as they should, instead of Short Psicod&ea it shows Short Psicodélico.
The text is encoded in the database as "Short Psicodélico". You will need to scrub your database to remove the encoding, as well as fix your input routines to make certain that text is not saved to the database encoded.

Unicode and PHP - am I doing something wrong?

I'm using Kohana 3, which has full support for Unicode.
I have this as the first child of my <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The Unicode character I am inserting into is é as in Café.
However, I am getting the triangle with a ? (as in could not decode character).
As far as I can tell in my own code, I am not doing any string manipulation on the text.
In fact, I have placed the accent straight into a view's PHP file and it is still not working.
I copied the character from this page: http://www.fileformat.info/info/unicode/char/00e9/index.htm
I've only just started examining PHP's Unicode limitations, so I could be doing something horribly wrong.
So, how do I display this character? Do I need to resort to the HTML entity?
Update
So this works
Caf<?php echo html_entity_decode('é', ENT_NOQUOTES, 'UTF-8'); ?>
Why does that work? If I copy the output accented e from that script and insert it into my document, it doesn't work.
View the http headers. You should see something like
Content-Type: text/html; charset=UTF-8
Browsers don't pay much attention to meta tags, if there was a real http header stating a different encoding.
update
Whatcha get from this?
echo bin2hex('é');
echo chr(0xc3) . chr(0xa9);
You should get c3a9é, otherwise I'd say file encoding issue.
I guess, you see �, the replacement character for invalid UTF-8 byte sequences. Your text is not UTF-8 encoded. Check your editor’s settings to control the encoding of the PHP file.
If you’re not sure about the encoding of your sources, you can enforce UTF-8 compatibilty as described here (German text): Force UTF-8.
You should never need entities except the basic ones.

Arabic characters corrupt on landing, fine after refresh - UTF8

I have an php page with mixed Latin and Arabic characters. The charset declaration tag is in the html code
and the file is saved as UTF-8. All the text is static and in the php file (does not come from a DB or an external source)
When I browse to the site some pages randomly get corrupt in IE and FF and display all question marks. After I refresh the page, text is displayed properly though... I have been working with Arabic and Hebrew for a long time and this is the first time I run in to this issue. Can anybody think of a cause?
Chrome is always fine...
Turns out the script reference that was before the meta description was causing the problem. I moved
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
to be the first item after the opening head tag and this is no longer an issue. Thanks for all the comments..
P.S I wasn't the one who code this page, and only working on localizing it, thats why I didn't even think that meta tag being after script would even make a difference...
Try to send appropriate header, something like this:
header("Content-Type: text/xml; charset=utf-8");
Try using UTF8_encode on your content:
http://php.net/manual/en/function.utf8-encode.php
If you have some text you want to store in a DB and display even if the page encoding is latin-1, there is a free tool that can convert Unicode to escaped HTML:
http://www.sprawk.com/tools/escapeUnicode

Categories