Im using Sphider as a search engine for my website, its really easy to work with but im having some major issues with localized characters.
All of my html/php pages have the charset defined as UTF-8 and the search and result page from Sphider had charset=ISO-8859-1, when I first used the Sphider "spider" to crawl my website it made all of my localized characters into some codification I dont know:
"ç" become "ç" and so on with "ã", "á" etc
When I created the DB in MySql I made it a utf-8_general_ci also my defenitions for the DB are :
MySQL charset: UTF-8 Unicode (utf8)
MySQL connection collation: utf-8_unicode_ci
This is a real problem because the search wont work properly, if I search "diferença" for instance, in the url it will appear as "?query=diferença&search=1" which is correct but will produce no results in the "suggested search" it will appear as "diferen�a" in case its not visible, the "ç" has become a black square with a white question mark on it.
I believe the spider might have a different working charset but I dont seem able to understand were if it is to be the case. Also being developed towards English primarily I believe its not hard to understand that it has some hiccups along the way.
Does anyone has any experience with it or what should I try to do to solve this?
What really bugging me is not understanding why I get strange symbols in the DB.
Quickly browsing through some Sphider source code files revealed that the application works only with Latin1 charset. You should switch to some other search engine, like Lucene. You'll need to do a bit more search-related coding though. If you don't feel like doing it, and your site is public, just integrate Google search.
You should have EVERYTHING in utf-8.
The forms who edit any given page
The physical files
The outputted html files
The headers
The connection to the database
The table definition
Miss one and you will have problems (I'm talking from personal experience)
Modify the line 4 of file "header.html" in appropriate template directory to <meta http-equiv="content-type" content="text/html; charset=UTF-8">
Convert the appropriate php file in "languages" directory to UTF8.
If the above doesn't suffice, follow the answer by The Disintegrator as well.
Related
This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 years ago.
Tried searching for this question but I think I don't know the jargon. I am entering my site content into a mysql database using php, but all of my accented letters or apostrophes (spanish) get transformed into some crazy encoding. Ex:
' becomes â€
á becomes á
and etc. First off I don't know what this means or is, but when displaying them on my site, they definitely do not revert back, nor would I expect them too because if I manually enter in UTF-8 letters they totally work on my site.
Is there a way to fix this without re-entering all of my text? I have a feeling I can extract them using php, decode them and then insert them back in but I do not know the functions that do this. The best solution, and if anyone knows how that would be amazing, would be to just do it within sql. By the way the columns say collation utf8_general_ci.
For some further information, I am not doing anything to the text that gets entered into the database (I know that is bad but I suck at this stuff!) Also I am not doing anything when it is being queried. My functions insert pure text and extract pure text to each page. In this way I can write html into my forms and it appears as html on the page and therefore the browser interprets it correctly. I also have a feeling this is really sloppy but like I said... Thanks for all the help!
-- Edit --
So thanks to people who pointed me to the other questions. However, the way I fixed it was not in that answer, it just gave me the write keywords to start a new search. For anyone who has this problem, the way I fixed it was using the function utf8_decode(). I'm not sure this is a great fix, but at least it is working for now and speed was my biggest priority. I am certain the core problem is in how I am entering the data into the database.
You can make sure the character encoding for INSERT queries are correct at runtime using $mysqli->set_charset("utf8")
It's also prudent to make sure that PHP is sending the correct HTTP response headers, and the Browser is doing the right thing:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
ini_set('default_charset', 'utf-8')
To alter your tables, do ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_general_ci;
I saw a website like this i.e. http://www.a3malcom.com/index.php. I want to build same kind of website in Arabic website. I was wondering does entries into database table also needs to be done in Arabic or English?
What if i need the website in 2 languages i.e. english and arabic. In what language should data should be entered in DB.
Take a look at comprehensive article: (Thanks to #Deceze for great article)
Handling Unicode Front To Back In A Web App
It also has Arabic example with other languages:
Yes you should insert data in Arabic in db table. So you can read it easily in web page and no need to convert.
And use utf-8 encoding while displaying the page
Usually unicode is all that is needed (UTF-8).
Your source file should be encoded UTF-8 if you want to write arabic in the PHP source. Note that some text editors don't support arabic properly.
For the database, just create your database with UTF-8 encoding.
For HTML output, add this to your HEAD section:
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
Alternatively, you can send it as HTTP header... but why bother.
Anyway, I have encoutered a problem generating Arabic sentences as images in GD - using a custom font. Turns out GD doesn't render arabic properly (at least in my case).
That was solved using the following library:
http://ar-php.org/
(yes the website is pretty ugly, but the library works, is well packaged, contains documentation...)
All I had to do, in my case, to fix the problem, is:
$Arabic = new I18N_Arabic('Glyphs');
$text = $Arabic->utf8Glyphs($_GET['txt']);
And then feed $text to GD.
I'm encountering a minor problem with text direction, though, and looking for a solution. But at least I have valid arabic now.
But in most cases you won't need that library, but will just need to make sure that you're using UTF-8 in all your development process.
Hope it helps.
I'm trying to figure this out but I'm quite puzzled at the mo.
I have a directory in my website containing pdf files with greek filenames (ie ΤΙΜΟΚΑΤΑΛΟΓΟΣ.pdf)
I want to have links for the files on a web page so that users can open or save the files.
So far I can list the files ok but if I click on them I get a 404 error. It's as if the server thinks they're not there although they are.
I understand it's problably an encoding issue but beyond that I'm not sure what to look for. The website encoding is utf-8 and in order to display the filenames correctly I had to use mb_convert_encoding($file->filename, 'utf8', 'iso-8859-7').
This is the url: http://www.med4u.gr/timokatalogoi/
This is the directory listing: http://www.med4u.gr/pricelists/
The site is based on Joomla and it's hosted on a linux server.
Any ideas?
ISO-8859-* MUST DIE! (That's not personal!) Do everything in UTF-8. Everything. With good reason, some of us get upset when we see them being used, especially Latin-1 (8859-1) which bites a lot of people. I think you would find it very helpful to just dump them and move on to UTF-8.
Things to check:
Store your files encoded in UTF-8: Usually no difficulties with that.
Make sure your server is sending the files with UTF-8 charset: add header('Content-Type: text/html;charset=UTF-8'); near the top of your PHP.
Just in case someone saves your page, it's helpful in that case to put the same thing in a <meta> tag in the head.
Check it all in your browser: right click, view page info, and make sure the encoding is right.
CPanel is very flexible, so that's all doable without much fuss. Feel free to comment if you want more detail.
If you have a database, there are a few more hoops to jump through, but it's worth it. With UTF-8 you never have to worry, and it's the definitive, future-proof way of doing things.
Let's suppose for the sake of argument that the file name on disk is aa.pdf but your conversion displays it as ab.pdf. You need either to revert the conversion so it points back to aa.pdf, or teach the server to remap or redirect requests for ab.pdf to this file. Or if you prefer, rename the file to ab.pdf instead, if your file system can handle this name.
It's definitely an encoding problem. You'll need to escape the URL, or convert it to whatever character set your server recognises.
e.g. 'ΤΙΜΟΚΑΤΑΛΟΓΟΣ LASER.pdf' in iso-8859-7 = 'ÔÉÌÏÊÁÔÁËÏÃÏÓ LASER.pdf' in iso-8859-1
I have two tables here - one is in UTF and holds Arabic text as it can be read. The other one has a different encoding however and the content is Arabic however in the database its displayed as
ÈöÓúãö Çááøåö ÇáÑøóÍúãóäö ÇáÑøóÍöíãö
I have to show data from both tables on the same page - the page is UTF encoded however I'm not sure if this can be done or if its possible. What do i do? My database is mysql and I'm using php.
Is it possible to convert the encoding of the contents of the other table into UTF8 btw?
You have to use mb_convert_encoding() first, on everything, to make sure it's all in UTF-8 to begin with. http://us3.php.net/manual/en/function.mb-convert-encoding.php Then it should display, assuming your HTML's charset is UTF-8 and the users have the appropriate fonts installed.
Also, virtually all consoles and a great many free online SQL commanders (like PHPMyAdmin) are not UTF-8 aware and print out jibberish. I have not yet found a free SSH client that supports UTF-8; if it's a big deal, invest in SecureCRT.
EDIT:
Excuse me. I don't read Arabic at all, but I did get Arabic back. please tell me if this is the correct text, and if so, accept this answer ;_)
ب?س?ك? افف?م? افر??ح?ك?ل? افر??ح?ٍك?
The code I used to get this was:
header('Content-Type: text/html;charset=utf-8');
echo mb_convert_encoding('ÈöÓúãö Çááøåö ÇáÑøóÍúãóäö ÇáÑøóÍöíãö', 'utf-8', 'iso-8859-6');
I found the Arabic encoding via this page: http://a4esl.org/c/charset.html
Cheers!
I'm still learning the ropes with PHP & MySQL and I know I'm doing something wrong here with how character sets are set up, but can't quite figure out from reading here and on the web what I should do.
I have a standard LAMP installation with PHP 5, MySQL 5. I set everything up with the defaults. When some of my users input comments to our database some characters show up incorrectly - mostly apostrophes and em dashes at the moment. In MySQL apostrostrophes show up as ’. They display on the page this way also (I'm using htmlentities to output user comments).
In phpMyAdmin it says my MySQL Charset is UTF8-Unicode.
In my database my tables are all set up with the default Latin1-Swedish-ci.
My web pages all have meta http-equiv="Content-Type" content="text/html; charset=utf-8"
When I look at the site's http headers I see: Content-Type: text/html
Like a newbie, I hadn't considered character sets at all until things started looking odd on some of my pages. So does it make most sense for me to convert everything to utf-8 and will this affect my PHP code? Or should I try to get it all into Latin? And do I have to go into the database and replace these odd codes, or will they magically display once I set up the charsets properly? All the fiddling I've done so far hasn't helped (I set the http headers to utf-8, and also tried latin).
If you really want to understand these issues, I would start by reading this article at mysql.com. Basically, you want every piece of the puzzle to expect UTF-8 unicode. On the PHP side, you want to do something like:
<?php header("Content-type: text/html; charset=utf-8");?>
<html>
<head>
<meta http-equiv="Content-type" value="text/html; charset=utf-8">
And when you run your insert queries you want to make sure both the table's character encoding and the encoding that you're running the queries in are UTF-8. You can accomplish the latter by running the query SET NAMES utf8 right before you run an insert query.
http://www.phpwact.org/php/i18n/charsets
That site gave me a lot of good advice on how to make everything play nice in UTF-8.
I also recomened switching from htmlentities to htmlspecialchars as it is more UTF friendly.
The main point is to make sure everything is talking the same language. Your database, your database connection, your PHP, your page is in utf8 (should have a meta tag and a header saying so).
Sorry for not understanding all of your question. But when part of the question is "UTF-8 or not?", the answer is: "UTF-8, of course!"
You definitely want to sort things out now rather than later. One of the most important programming rules is not to keep going with a bad idea - don't dig yourself in any deeper!
As latin1 and utf-8 are compatible, you can convert your tables to use utf-8 without manipulating the data contained by hand. MySQL will sort this part out for you.
It's then important to check that everything is speaking utf-8. Set the http headers in apache or use a meta tag - this says to a browser that the HTML output is utf-8.
With this in mind, you need to make sure all of the data you send really is utf-8! Configure your IDE to save php/html files as utf-8. Finally make sure that PHP is using a utf-8 connection to MySQL - issue this query after connecting:
SET NAMES 'utf-8';