get UNICODE character instead of HEX - cURL PHP - php

I am using this scraper for IMDB, and the problem is that some characters are in UNICODE ï.
I use this scraper with CURL, and the answer its a string encoded in UTF8
I try to get the encode of the string with mb_detect_encoding() and it answer with UTF-8
$html = $this->geturl("${imdbUrl}combined");
mb_detect_encoding($html);
So I have a string with some HEX values inside, like this for example:
$var = 'Saïd Taghmaoui'
So I try to get the value of $html with utf8_decode() but no luck, I still have some characters in HEX.
So I have a few questions:
1- What's the best solution for this? I imagine different scenarios for example a read the string and with a REGEX change all the HEX codes with the character, but I am not sure if this one its the best solution, and also I dont know how to create the REGEX for this.
2- The solution can be through cURL? I mean manage some configurations to set the encoding of cURL in UTF-8 for example?
I try with the functions recode_string or iconv or mb_convert_encoding

Well basically my problem is that the answer from the scraper comes with UTF-8 encoding, but before print the text I need to work the data with this functions
$var = 'Saïd Taghmaoui'
htmlspecialchars(html_entity_decode($var, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');

Related

How to change encoding of a web page retrieved by Simple HTML DOM?

I am trying to read contents of a web page
$html = file_get_html('http://www.example.com/somepage.aspx');
Since the page's encoding is Windows-1254, and I work on a page encoded as UTF-8, I cannot replace some words which have language-specific characters.
For Example:
If I try to
$str2 = str_replace('TÜRKÇE', 'TURKCE', $str);
it does not replace.
I have tried htmlentities() function, It worked but deleted some words which contains special characters.
Work in utf-8 only. If you have some data in other encodings, convert it. If you does not know the encoding, try to define it. If you cannot, use users. Then use mb_* functions only for all string operations, It is important! some functions is not present in native php, but search its hand-make realizations on php.net/.. in comments.
After getting strings I have used iconv('Windows-1254', 'utf-8', $str) function (thanks to #pguardiario). This solved my problem.

Unicode encoding in php with Hebrew

i am trying to get some information from a webpage however it is in a different encoding is there an easy way to convert to utf8 and then use it?
For example i am getting these urls which i will need to visit
http://www.mega.co.il/jsfweb/cat/טופו/
http://www.mega.co.il/jsfweb/cat/גבינה_מלוחה/
http://www.mega.co.il/jsfweb/cat/גבינה_לארוח/
http://www.mega.co.il/jsfweb/cat/גבינה_מותכת/
http://www.mega.co.il/jsfweb/cat/גבינה_צהובה/
http://www.mega.co.il/jsfweb/cat/גבינה_לבנה/
http://www.mega.co.il/jsfweb/cat/קוטג/
how do i turn that to utf8 and then urlencode in php?
You can try function html_entity_decode() to decode that entities. To change decoding, use mb_convert_encoding(). I have no experience with Hebrew, so I don't know if it would work.

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

PHP: how do I convert foreign characters from simple_html_dom to UTF8?

I'm having some trouble with a string that comes from a webpage having foreign characters in it.
The string is generated by parsing the webpage using str_get_html(), followed by $htmldom->innertext; (simple_html_dom class library).
When I output the string using htmlentities() it is displayed fine; but using explode() on the string and printing the parts, I get a tilted block with a question mark in it for each foreign character.
I need to store the string in a utf8 MySQL database, so I need the right foreign characters.
My page has a header with utf8 character set.
I have already tried mb_split() and preg_split(), but those have the same problem.
I solved the issue with :
https://github.com/neitanod/forceutf8
It has a great function that just converts anything to utf-8, no matter what source it's from (as long as it comes in Latin1 (iso 8859-1), Windows-1252 or UTF8 already, or a mix of them).
Many thanks go to Sebastian Grignoli.
PHP and UTF-8 isn't a very good combination. Some functions work fine with UTF-8, others don't, and the worst are those that are documented to work, but in fact do not (such as DOMDocument ).
You can use mb_convert_encoding() to convert multibyte characters to HTML entities, which usually provides an acceptable workaround:
$string = mb_convert_encoding($string, 'HTML-ENTITIES', 'UTF-8');

PHP: utf-8 encode, htmlentities giving weird results

I'm trying to get data from a POST form. When the user inputs "habláis", it shows up in view source as just "habláis". I want to convert this to "habláis" for purposes of string comparison, but both utf8_encode() and htmlentities() are outputting habláis, and htmlspecialchars() does nothing. I would use str_replace but it won't recognize the á when it searches the string.
I'm using a charset of utf-8 consistently across pages. Any idea what's going on?
You are probably not specifying UTF-8 as the character set for the htmlentities() operation.
I'm not sure if this is your problem, but are you calling htmlentities with the UTF-8 parameter? I ask because that's not its default:
Like htmlspecialchars(), it takes an
optional third argument charset which
defines character set used in
conversion. Presently, the ISO-8859-1
character set is used as the default.
So you might want to try calling your function like this:
$output = htmlentities($input, ENT_COMPAT, 'UTF-8');
Does this solve your problem?

Categories