I'm parsing a page using php cURL and slicing data using DOMDocument() function. Parsed page has "UTF-8" encoding.
Then i write data to database. But instead of
музыка
it is writing ASCII codes like this:
Музыка
I've tried iconv(), mb_convert_encoding(), utf8_encode, but still get the same. strlen() return the length of the coded string.
How to encode this to normal text?
<?php
$string ="Музыка";
echo html_entity_decode($string, ENT_NOQUOTES, 'UTF-8');
prints:
Музыка
Related
I am using this scraper for IMDB, and the problem is that some characters are in UNICODE ï.
I use this scraper with CURL, and the answer its a string encoded in UTF8
I try to get the encode of the string with mb_detect_encoding() and it answer with UTF-8
$html = $this->geturl("${imdbUrl}combined");
mb_detect_encoding($html);
So I have a string with some HEX values inside, like this for example:
$var = 'Saïd Taghmaoui'
So I try to get the value of $html with utf8_decode() but no luck, I still have some characters in HEX.
So I have a few questions:
1- What's the best solution for this? I imagine different scenarios for example a read the string and with a REGEX change all the HEX codes with the character, but I am not sure if this one its the best solution, and also I dont know how to create the REGEX for this.
2- The solution can be through cURL? I mean manage some configurations to set the encoding of cURL in UTF-8 for example?
I try with the functions recode_string or iconv or mb_convert_encoding
Well basically my problem is that the answer from the scraper comes with UTF-8 encoding, but before print the text I need to work the data with this functions
$var = 'Saïd Taghmaoui'
htmlspecialchars(html_entity_decode($var, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');
I receive json answer from another script. Next I used $json = json_decode($json) and die(json['message']) for show specific string, and this value contains Cyrillic data.
Function mb_detect_encoding() shows that string in UTF-8.
Ok, I use charset="utf-8" in html file, but
I see this output "Пользователь с этим адресом электронной почты уже существует" in my browser.
I used mb_convert_encoding(json['message'], 'UTF-8'), without any effect/
Only var_dump($json) shows me decoded string.
Maybe I wrong to access data in json?
Use mb_convert_encoding(json['message'], "utf-8", "windows-1251"); to properly convert string.
Please can you help me decode this URL so that it displays properly using PHP to output
This is the link
http://www.megalithic.co.uk/visits.php?op=site&sid=18341&title=Ōyu
I think it's actually coming through as UTF-8 - ie
&title=%C5%8Cyu
$title displays as ÅŒyu
How do I convert this in PHP? I need to use ISO-8859-1 on the page
None of these work
$title=iconv("UTF-8","ISO-8859-1",$title);
$title=iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $title);
$title = utf8_decode($title);
$title = urldecode($title);
Do I need to use the Multibyte MB extension and if so how?
Many thanks in advance
Andy
If that link is to your PHP page, and you get the value via $_GET['title'], then it's already decoded from the URL encoding and $_GET['title'] holds a UTF-8 encoded string with the character Ō. This character cannot be encoded in ISO-8859-1. If that is a strict requirement, you'll have to encode the character as HTML entity in order to express it in a strictly ISO-8859-1 encoded page:
echo htmlentities('Ō', ENT_COMPAT | ENT_HTML5, 'UTF-8');
The character "Ō" is not there in ISO-8859-1, so it is not possible to convert it from UTF-8 with any of the standard charset conversion functions.
It might, however, be possible to write a function that converts to numerical HTML encodings, like Ō for "Ō".
My php file is in UTF-8 encoding and I am trying to encode my data for safe sending into application but some characters get encoded incorrectly.
$text = "Š";
$text = urlencode(utf8_decode($text));
echo $text;
Echos %3F but according to w3c urlencoding reference found here (http://www.w3schools.com/tags/ref_urlencode.asp), "Š" should be converted into %8A. Php's own reference also does not state what reference is it using. Could this be encoding/decoding issue or something else?
utf8_decode tries to convert from UTF-8 to ISO-8859-1 but Š does not exist in ISO-8859-1. So you obtain '?' (= %3F), the substitution character.
It exists in CP1252 (maybe others), under the hexadecimal code 8A. So:
$text = urlencode(iconv('UTF-8', 'CP1252', $text));
Should give what you expect. In fact, you shouldn't decode an unicode string.
I have a string like so "Ba\u015f\u00e7\u0131l". I'm assuming those are some special accent characters. How do I:
1) Display the string with the accents (i.e replace code with actual character)
2) What is best practice for storing strings like this?
2) If I don't want to allow such characters, how do I replace it with "normal characters"?
My educated guess is that you obtained such values from a JSON string. If that's the case, you should properly decode the full piece of data with json_decode():
<?php
header('Content-Type: text/plain; charset=utf-8');
$data = '"Ba\u015f\u00e7\u0131l"';
var_dump( json_decode($data) );
?>
To display the characters look at How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
You can store the character like that, or decoded, just make sure your storage can handle the UTF8 charset.
Use iconv with the translit flag.
Here's an example...
function replace_unicode_escape_sequence($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
$str = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
echo $str;
echo '<br/>';
$str = iconv('UTF8', 'ASCII//TRANSLIT', $str);
echo $str;
Here's another option:
<html><head>
<!-- don't forget to tell the browser what encoding you're using: -->
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
</head><body><?php
$string = "Ba\u015f\u00e7\u0131l";
echo json_decode('"'.str_replace('"', '\"', $string).'"');
?></body></html>
This works because the \u000 syntax is what JSON uses. Note that json_decode() requires the JSON module, which is now a part of the standard PHP installation.
There is no native support in PHP to decode such strings.
There are several tricks to use native function though I am not sure that any of those is safe and injection proof :
json_decode . See http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
xml parser
regex replace
If anybody has other options for escaping/unescaping Utf8 using native function, please post a reply.
Another option using Zend Framework is to download the Zend_Utf8 proposal class. See more information at Zend_Utf8 proposal for Zend Framework
Outputing them would output the appropriate character. If you don't provide any encoding for the output document, the browser would try and guess the best one to show. Otherwise you should figure it out and output explicitly.
Simply store them, or turn them into normal chars and binary store them.
Use iconv functions to convert from one encoding to another, then you shuold save your source file with the desired encoding to support it.