php parser grabbing ISO wbesite, to be converted to UTF8 - php

I am parsing some content from another website which i can see in the header is iso-8859-1. But the cms system which is pulling in the content is UTF8.
It gets most characters, but things like " get weird characters. Im not sure how to convert this content properly.
Can anyone help plz

You want to use utf8_encode($content) and utf8_decode($content) to save and restore your code in and out the database.

Related

Characters appearing different from API

I am trying to work with the Amazon Associates API, but whenever trying to get the information of a product, the characters come out in a weird way.
Example
Text on the Amazon page:
🔥 【23800 mAh
Output of the JSON from the API: 🔥 ã€23800 mAh
Just like this, more weird characters are appearing, such as a dash transformation in a question mark.
I've used a code snippet in PHP that was provided by them, which contained the following line which determined the charset:
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
Does anyone have a pointer where I might be going wrong here, and what I could do to fix this weird conversion?
I don't get what you are actually going to do with this. Assuming you are going to keep this in a database I am going to give you a solution.
You said these characters are changing into those weird symbols. What happening here is those imojis/symbols are converting into a UTF-8 encoded data form. And that is why you are seeing those symbols which are pure UTF-8 form.
Now if you want to keep those data you have to keep that data into some kind of text encode. It doesn't have to UTF-8 only . There are many encodings available.
If you want to decode those you have to write them in a system where it can be shown normally. Like I have kept imojis in my database as UTF-8 many years ago. The mobile phone I was using, gave me encoding for one emoji. I saved it and next time when I am seeing the data with my PC browser I am seeing some other symbols. The decoding system must be installed when you want to see them next time.
The point is you can not save the data as you can see in the Amazon.
This line
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
tells that the request you are sending is UTF-8 encoded and is JSON encoded data. It does not ask for a specific encoding for the response.
However the response you are getting is UTF-8 encoded
You read 🔥 ã€23800 mAh because you are displaying the received data using a encoding different from UTF-8.
A more detailed answer may be given if we see how you are writing the output and which is the context (web page, terminal...)

Utf8 in html correct and php html output messed up

This issue is mind-boggling to me. I am facing the following situation. I wrote a website in html using the utf8 charset. Special characters are displayed as expected. Now I want to give out some php mysql results, so the easiest way is to create a php file, include the html code and then give out the results. However the html given out via the php file does not display the special characters correctly... it's not utf8
here is the html version: HTML
and here the exact copy in a php file: HTML VIA PHP
To close this question myself (because I feel rather stupid right now), the one who actually solved this is Marc B as his comments made me understand the process of text encoding.
After setting the header (Content Type and charset) as well as setting the meta tag in HTML I discovered, just like Marc suspected that my IDE had encoded the php file in another encoding than UTF8. Saving the file as UTF8 and replacing the messed up specialchars fixed my issue.
Please excuse this, I wasn't fully aware of what I was doing.

html content into a page

I need to pull the content from the database on the page, but some of this contents have the whole HTML page - with css, head, etc...
What would be the best way prevent having all htlm tags, scripts, css? Would iframe help here?
The most bothering thing is that I'm getting strange characters on the page: �
and as found out it is due to different encoding.
The site has utf-8 encoding and if the content contains different encoding, these signs come out and I cannot replace them.
The only thing it make them remove was to change my encoding, but this is not the real solution.
If someone could tell me how to remove them, would be really great.
Solution: with your help I checked encoding, but couldn't change it. I set names in mysql_query to UTF-8, and stripped unusefull tags. Now it seems ok.
Thanks to all of you.
I think you have no chance apart an ugly iframe. About encoding, you should check db encoding, connection encoding and convert as needed. Use iconv for full control over conversion, for example:
$html=iconv("UTF-8", "ISO-8859-15"."//TRANSLIT//IGNORE",$html]);
In this case, you're going to lose some characters not mapped in ISO-8859-15. Consider moving your whole site to UTF-8 encoding.
The � tags in fact might not be due to encoding, the problem might be the content that is stored in the database.
Check for double quotes like “ which are supposed to be ", more so if the data in the table was copy pasted.

How to get rid of � using php

I am pulling comments out of the database and have this, �, show up... how do I get rid of it? Is it because of whats in the database or how I'm showing it, I've tried using htmlspecialchars but doesn't work.
Please help
The problem lies with Character Encoding. If the character shows up fine in the database, but not on the page. Your page needs to be set to the same character encoding as the database. And vice a versa, if your page that posts to the database character encoding does not match, well it comes out weird.
I generally set my character encoding to UTF-8 for any type of posting fields, such as Comments / Posts. Most MySQL databases default to the latin charset. So you will need to modify that: http://yoonkit.blogspot.com/2006/03/mysql-charset-from-latin1-to-utf8.html
The HTML part can be done with a META tag: <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
or with PHP: header('Content-type: text/html; charset=utf-8'); (must be placed before any output.)
Hopefully that gets the ball rolling for you.
That happens when you have a character that your font doesn't know how to display. It shows up differently in every program, many Windows programs show it as a box, Firefox shows it as a questionmark in a diamond, other programs just use a plain question mark.
So you can use a newer display system, install a missing font (like if it's asian characters) or look to see if it's one or two characters that do this and just replace them with something visible.
It might be problem of the way you are storing the information in the database. If the encoding you were using didn't accept accents (à, ñ, î, ç...), then it stores them using weird symbols. Same happens to other language specific symbols. There is probably not a solution for what's already in the database, but you can still save the following inserts by changing the encoding type in mysql.
Cheers
Make sure your database UTF-8 (if it won't solve the problem make sure you specify your char-set while connecting to the database).
You can also encode / decode before entering data to your database.
I would suggest to go with htmlspecialchars() for encoding and htmlspecialchars_decode() for decoding.
Are you passing your charset in mysql_set_charset() with mysql_connect() ???
As others have said, check what your database encoding is. You could try using utf8_encode() or iconv() to convert your character encoding.
Check your code for errors. That's all one can really say considering that you have given us absolutely no details as to what you're doing.
Encoding problems are usually what cause that (are you converting from integers to characters?), so, you fix it by checking if you're converting things properly.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Categories