I am working on the Facebook Public Search API.
As you may understand the results I get come for many different sides of the world.
What I have to do is to give to all the texts I get the same text encoding before putting it inside my MongoDB. I need to use UFT8 as a general and working encoding.
This is an example of what I may get from Facebook:
10 ผู้นำที่โลà¸à¹„ม่ปรารถนา หาà¸à¹„ม่มีผู้นำประเภทนี้à¹à¸¥à¹‰à¸§à¹‚ลà¸à¹€à¸£à¸²à¸à¹‡à¸ˆà¸°à¸”ีขึ้นเยà¸à¸° โดยไทยติดà¸à¸±à¸™à¸”ับ 1 à¸à¹ˆà¸²à¸™à¸•à¹ˆà¸à¹„ด้ที่นี่
or
Now he says he’d side with Pakistan if there were a conflict with the U.S. Better than the Taliban for sure, but not by much. The poor people of Afghanistan… Ayman al-Zawahiri: Al-Qaeda’
or
€™esercito tedesco, il primo modello di A400M è in fase di collaudo, e ci resterà per tre anni
Is there a function in PHP that can quickly convert the text into a UFT8 text encoding?
Did you try this function?
http://php.net/manual/en/function.utf8-encode.php
Related
This problem might not occure in English, but does really hurt in Polish language. I guess that my question is mostly for Polish users since they might already have a decent solution.
What I mean, is that the verbs in Polish language, are different for male and female in past time. And there are dozens of different options. If my script need to display lots and lots of text - it really becomes a painful problem to deal with. Short example (not very elegant use of language, but for demonstration purpose):
Male: On poszedł i nie znalazł, więc klasnął w dłonie i nagle go coś pożarło.
Female: Ona poszła i nie znalazła, więc klasnęła w dłonie i nagle ją coś pożarło.
I managed to find such an solution: each time at the beginning of script, I prepare variable that looks like that:
$verb[$ending][$sex] = 'something';
//$ending does contain - for my convenience - letters that says what kind of eding am I changing, instead of numeric options
//Examples:
$verb['-a']['male'] = '';
$verb['-a']['female'] = 'a';
//works for On=>Ona, znalazł=>znalazła
$verb['al-ela']['male'] = 'ął';
$verb['al-ela']['female'] = 'ęła';
//works for klasnął=>klasnęła
Now if I add fact, that 99% of time I don't know from the beginning what kind of sex am I dealing with, my variable start look kinda scary: $verb['al-ela'][$_SESSION['user'.$id]['sex']]. So my end text does look like that:
O'.$verb['-a'][$_SESSION['user'.$id]['sex']].' posz'.$verb['edl-la'][$_SESSION['user'.$id]['sex']].' i nie znalazł'.$verb['-a'][$_SESSION['user'.$id]['sex']].', więc klasn'.$verb['al-ela'][$_SESSION['user'.$id]['sex']].' w dłonie i nagle '.$verb['go-ja'][$_SESSION['user'.$id]['sex']].' coś pożarło.
Yes, sure - this is rather extreme example, but sometimes text really does look like that and it is unavoidable.
To make long story short, here are my questions:
Am I doing it wrong? Is there a better/faster/more handy solution for such type of problems?
Is there a script that might detect/change endings for me without ruining rest of the text?
I struggled to find full list of possible ending variations in Polish (for both singular, and plural), so I'm creating my own list as I'm finding new options. Perhaps someone does have a list like that => it might help me to create script from my 2nd question.
Thanks a lot in advance, best regards!
I've tried to solve an issue with character encoding for many days now without finding any solution.
Here's what's happening:
I have a form in a page.
When I copy paste a text from Adobe Reader to this form, everything goes fine.
When I copy paste a text from Preview (mac os image viewer), it turns into strange characters.
When the form is submitted, the sentence:
salade mêlée, tomates, mozzarella, basilic melon en saison et jambon cru
Goes through an ajax function and I can see in firebug:
salade%20me%CC%82le%CC%81e%2C%20tomates%2C%20mozzarella%2C%20basilic%20melon%20en%20saison%20et%20jambon%20cru
Now when I get this value into my Zend Controller, in order to save it to my database, I meet the following cases:
if i iconv it to cp1252, the text is cut to "salade me" and that's it
If if utf8_encode it transforms into: salade meÌleÌe, tomates, mozzarella, basilic melon en saison et jambon cru
If I utf8_decode it, it goes to: salade me?le?e, tomates, mozzarella, basilic melon en saison et jambon cru
If I do no transformation, it works...but in phpmyadmin i see: salade mêlée, tomates, mozzarella, basilic melon en saison et jambon cru
Any idea to help me? I'm turning crazy!!
Thanks!
Make sure that phpMyAdmin is configured to use UTF-8, and that the database is also using UTF-8, as well as the connection between PHP and the database. If all of them are using UTF-8, then you should have no issues passing UTF-8 back and forth.
The strange encoding occurs when article includes letters like "ş, ç, ö, İ". How can I fix this ?
Kesme \u015fekere benzeyen, kire\u00e7 beyaz\u0131 evleriyle kar\u015f\u0131l\u0131yor BODRUM bizi..T\u0131pk\u0131, \u00e7ocuklu\u011fumuzdaki gibi.. Pencerelerdeki mavi \u00e7izgiler, denizin g\u00fcl\u00fcmsemesi adeta.. Ve denizden esen ilk r\u00fczgar bir\u00e7ok kokuyla ho\u015fgeldin diyor bize..Eski sevgililer, aile, dostluklar, an\u0131lar.. Film \u015feridi gibi ge\u00e7iyor \u00f6n\u00fcm\u00fczden
Are you sending this data in JSON? If so, that's standard encoding for Unicode characters, and you can simply run it through json_decode to decode them.
If not, more information will be needed to help you.
I have this wiki from the API http://fr.wikipedia.org/w/api.php?action=query&titles=%C9rythropo%EF%E9tine&prop=revisions&rvprop=content&format=xmlfm
which I would like to retrieve the main content starting from:
L''''érythropoïétine''' ('''EPO''') est une [[hormone]] ......etc
I tried for a start to preg_replace everything from the top starting from the word "{{Chimiebox..." to the bottom "}}" using this
preg_replace( '/^{{(.*)}}$/sim', '', $value[0]['*'] );
But kind of doesn't work..does anyone know of a good way to determine the start of the content?? Thanks for any advice.
Well, afaik the most projects use the Wikipedia Parser directly, e.g. the Wikipedia Offline Client Project at my university. Since you seem to be using php, this may the be the easiest way for you.
I have an unusual problem (this is linked to Browser displays � instead of ´)
I had mismatched character encoding settings on my server (UTF-8) and application (ISO-8859-1), so a third person tasked with entering Spanish translations, entered the words properly at his end, but they weren't saved correctly in the database.
I have subsequently fixed the problem and the server is now ISO-8859-1 as well. [I set
default_charset = "iso-8859-1"
in php.ini]
I do see a pattern in what is in the system, for example the following appears on the system:
Nombre de la organización*
This needs to be:
Nombre de la organización*
ie, I need to search and replace 'ó' with 'ó'.
How can I do so for an entire table (all fields)? (there will be other such corrections as well)
Use the replace function. Simple example:
SELECT REPLACE('www.mysql.com', 'w', 'Ww');
Result: 'WwWwWw.mysql.com'
Now, if you have a table called Foo and you want to replace those characters in a field called bar, you can do the following:
update Foo set bar = Replace(bar, 'ó', 'ó');
Do this for all the affected fields and the problem is solved.
Best regards,
Lajos Arpad.