HTML DOM PARSER UTF-8

HTML DOM PARSER UTF-8 - php

i have a website but not encoded with UTF-8. And i am including php file in another php. When i change encoding to UTF-8 all the characters went broken. So i can't use header(..utf8 bla bla tag.
include_once 'includes/simple_html_dom.php';
$ozet = file_get_contents($url);
$html = str_get_html($ozet);
$trozet = $html->find('div[class="TEST"]',0)->plaintext;
$icerik = "";
$yazi = "<span>$trozet</span>";
$uzunluk = strlen($yazi);
$sinir = 155;
if ($uzunluk > $sinir) {
$icerik = substr($yazi,0,$sinir) . "...";
}
$content.= '<i><span>'.$icerik.'</span></i>';
return $content;
But its getting html like this:
Pittsburgh kentinde sakin ve gÃ¼neÅŸli bir sabah, mesai saatinden hemen Ã¶nce insanlar iÅŸlerine doÄŸru koÅŸturmakta, gÃ¼nlÃ¼k telaÅŸlarÄ±nÄ± yaÅŸama...
It should be:
Pittsburgh kentinde sakin ve güneşli bir sabah, mesai saatinden hemen önce insanlar işlerine doğru koşturmakta, günlük telaşlarını...
How can i make this right?

to Substr utf-8 strings you can use a function like this:
function substrutf8($str,$from,$len){
return preg_replace('#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'. $from .'}'.'((?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'. $len .'}).*#s','$1', $str);}

If you can't use UTF-8, you must convert it to some other encoding:
$yazi = mb_convert_encoding("<span>$trozet</span>", "Windows-1250", "UTF-8");
Note that not every website will be in UTF-8 and Windows-1250 supports only a tiny subset of Unicode characters anyway.

Related

PHP using UTF8 characters in URL, url encoding fails

In my PHP script I try to send utf8 characters to the google translate website for them to send me a translation of the text, but this doesn't work for UTF8 characters such as chinese, arabic and russian and I can't figure out why. If I try to translate 'как дела' to english I could use this link: https://translate.googleapis.com/translate_a/single?client=gtx&sl=ru&tl=en&dt=t&q=как дела
And it would return this: [[["how are you","как дела",,,1]],,"ru"]
A fine translation, exactly what I wanted, but if I try to recreate it in PHP I do this (I used bytes in the beginning because my future script will use bytes as starting point):
<?php
$bytes = array(1082,1072,1082,32,1076,1077,1083,1072); // bytes of: как дела
$str = "";
for($i = 0; $i < count($bytes); ++$i) {
$str .= json_decode('"\u' . '0' . strtoupper(dechex($bytes[$i])) . '"'); // returns string: как дела
}
$from = 'ru';
$to = 'en';
$url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=' . $from . '&tl=' . $to . '&dt=t&q=' . $str;
$call = fopen($url,"r");
$contents = fread($call,2048);
print $contents;
?>
And it outputs: [[["RєR RєRґRμR ° \"° F","РєР°РєРґРµР»Р°",,,0]],,"ru"]
The output doesn't make sense, it appears that my PHP script send the string 'РєР°РєРґРµР»Р°' to translate to english for me. I read something about making UTF-8 characters readable for google in a URI (or url). It says I should transfer my bytes to UTF-8 code units and put them in my url. I didn't yet figure out how to transfer bytes to UTF-8 code units, but I first wanted to try if it worked. I started by converting my text 'как дела' to code units (with percents for URL) to test it myself. This resulted in the following link: https://translate.googleapis.com/translate_a/single?client=gtx&sl=ru&tl=en&dt=t&q=%D0%BA%D0%B0%D0%BA+%D0%B4%D0%B5%D0%BB%D0%B0
And when tested in browser it returns: [[["how are you","как дела",,,1]],,"ru"]
Again a fine translation, it appears it works so I tried to implement it in my script with the following code:
<?php
$from = 'ru';
$to = 'en';
$text = "%D0%BA%D0%B0%D0%BA+%D0%B4%D0%B5%D0%BB%D0%B0"; // code units of: как дела
$url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=' . $from . '&tl=' . $to . '&dt=t&q=' . $text;
$call = fopen($url,"r");
$contents = fread($call,2048);
print $contents;
?>
This script outputs: [[["RєR Rє RґRμR ° \"° F","РєР°Рє РґРµР»Р°",,,0]],,"ru"]
Again my script doesn't output what I want and what I get when I test these URL's in my own browser. I can't figure what I'm doing wrong and why google responds with a mess up of characters if I use the link in my PHP file.
Does someone know how to get the output I want? Thanks in advance!
Updated code to set strings in UTF-8, (not working)
I added a lot of settings at the top of the PHP file to make sure everything is in UTF8 format. Also I added a mb_convert_encoding halfway but the output keeps being wrong. The fopen function doesn't send the right UTF-8 string to google.
Output I get:
URL: https://translate.googleapis.com/translate_a/single?client=gtx&sl=ru&tl=en&dt=t&q=%D0%BA%D0%B0%D0%BA%20%D0%B4%D0%B5%D0%BB%D0%B0
Encoding: ASCII
File contents: [[["RєR Rє RґRμR ° \"° F","РєР°Рє РґРµР»Р°",,,0]],,"ru"]
Code I use:
<?php
header('Content-Type: text/html; charset=utf-8');
$TYPO3_CONF_VARS['BE']['forceCharset'] = 'utf-8';
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
mb_http_input('UTF-8');
mb_language('uni');
mb_regex_encoding('UTF-8');
ob_start('mb_output_handler');
$from = 'ru';
$to = 'en';
$text = rawurlencode('как дела');
$url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=' . $from . '&tl=' . $to . '&dt=t&q=' . $text;
$url = mb_convert_encoding($url, "UTF-8", "ASCII");
$call = fopen($url,"r");
$contents = fread($call,2048);
print 'URL: ' . $url . '<br>';
print 'Encoding: ' . mb_detect_encoding($url) . '<br>';;
print 'File contents: ' . $contents;
?>

Solved! I got the hint from another not from these forums to look at this stackoverflow post about setting a user agent. After some more research I found that this answer was the solution to my problem. Now everything works fine!

Why encoding doesn't work in function?

I have a function for clean an input (delete trim, special caractere and number) in specific file and an index who i call this function.
// In index.php
$input = format_input($_POST['name']);
// In inc/function.php
function format_input($input){
$pattern = '/[^a-zA-ZÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ\-\'\s]/';
$output = preg_replace($pattern, "", $input);
$output = trim($output);
$output = ucfirst(strtolower($output));
return $output;
}
if i use this function in my index, the encoding is OK, but if i use a call to this in another file, i have black losange on my regex.
the file are in utf-8 both, i don't understand why doesn't work !

Try this:
$output = ucfirst(mb_strtolower($output,'utf-8'));
instead of:
$output = ucfirst(strtolower($output));
http://php.net/manual/en/function.mb-strtolower.php

Converting SMS encoding to UTF-8 in PHP

I wrote an SMPP Server Transceiver in PHP.
I get this SMS string from my SMPP. It's a UTF8 message which is actually at 7Bit. Here is a sample message:
5d30205d30205d3
I know how to convert it. It should be:
\x5d3\x020\x5d3\x020\x5d3
I don't want to write it myself. I guess there is already a function that does that for me. Some hidden iconv or using pack() / unpack() to convert this string to the correct format.
I am trying to achieve this using PHP.
Any ideas?
Thanks.

This should do it :
$message = "5d30205d30205d3";
echo "\x".implode("\x", str_split($message, 3));
// \x5d3\x020\x5d3\x020\x5d3

Here is what i used eventually:
public static function sms__from_unicode($message)
{
$org_msg = str_split(strtolower($message), 3);
for($i = 0;$i < count($org_msg); $i++)
$org_msg[$i] = "\u0{$org_msg[$i]}";
$str = implode(null, $org_msg);
$str = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
return $str;
}
function replace_unicode_escape_sequence($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
10x. all.

PHP - ASCII special characters (without MySQL)

I am doing this PHP page that have access to a Google account and than shows all emails. I've defined a header = UTF-8 and meta too, I used a lot of PHP function to convert the output to UTF but I keep getting strange icons instead of ASCII special characters. Such as ç, é or ã.
header("Content-Type: text/html; charset: UTF-8");
$message = imap_fetchbody($inbox,$email_number,2);
echo $message;
What should be the output:
çççç
What I get:
=E7=E7=E7=E7

Use imap_qprint (see first comment on that page for an alternative solution).

It seems to be a known issue, regarding the first comment on the imap_fetchbody PHP doc page.
Use imap_qprint or use the commenter solution :
<?php
function ReplaceImap($txt) {
$carimap = array("=C3=A9", "=C3=A8", "=C3=AA", "=C3=AB", "=C3=A7", "=C3=A0", "=20", "=C3=80", "=C3=89");
$carhtml = array("é", "è", "ê", "ë", "ç", "à", " ", "À", "É");
$txt = str_replace($carimap, $carhtml, $txt);
return $txt;
}
$mbox = imap_open("{imap.gmail.com:993/imap/ssl}INBOX", "login", "pass");
$no = 5; // Mail to show (mail number)
$text = imap_fetchbody($mbox, $no, 1);
$text = imap_utf8($text);
$text = ReplaceImap($text);
$text = nl2br($text);
echo $text;
?>

How to handle user input of invalid UTF-8 characters

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.
Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.
As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...
I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.
Here is an example using iconv():
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven't tested it extensively though.
Example:
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
This may be what you were looking for.

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:
class utf8
{
/**
* #param array $data
* #param int $options
* #return array
*/
public static function encode(array $data)
{
foreach ($data as $key=>$val) {
if (is_array($val)) {
$data[$key] = self::encode($val, $options);
} else {
if (false === self::check($val)) {
$data[$key] = utf8_encode($val);
}
}
}
return $data;
}
/**
* Regular expression to test a string is UTF8 encoded
*
* RFC3629
*
* #param string $string The string to be tested
* #return bool
*
* #link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
*/
public static function check($string)
{
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$string);
}
}
// For example
$data = utf8::encode($_POST);

For completeness to this question (not necessarily the best answer)...
function as_utf8($s) {
return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

There is a multibyte extension for PHP. See Multibyte String
You should try the mb_check_encoding() function.

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.
Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.
The data you store in your database then is data triggered by the user, but not actually user-supplied data.
<?php
// Build alphabet
// Optionally, you can remove characters from this array
$alpha[] = chr(0); // null
$alpha[] = chr(9); // tab
$alpha[] = chr(10); // new line
$alpha[] = chr(11); // tab
$alpha[] = chr(13); // carriage return
for ($i = 32; $i <= 126; $i++) {
$alpha[] = chr($i);
}
/* Remove comment to check ASCII ordinals */
// /*
// foreach ($alpha as $key => $val) {
// print ord($val);
// print '<br/>';
// }
// print '<hr/>';
//*/
//
// // Test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv ' . chr(160) . chr(127) . chr(126);
//
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// // Test case #2
//
// $str = '' . '©?™???';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// $str = '©';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10), file($file));
$string = teststr($alpha, $testfile);
print $string;
print '<hr/>';
function teststr(&$alpha, &$str) {
$strlen = strlen($str);
$newstr = chr(0); // null
$x = 0;
if($strlen >= 2) {
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i], $alpha)) {
// Passed
$newstr .= $str[$i];
}
else {
// Failed
print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
print '<br/>';
$newstr .= '�';
}
}
}
elseif($strlen <= 0) {
// Failed to qualify for test
print 'Non-existent.';
}
elseif($strlen === 1) {
$x++;
if(in_array($str, $alpha)) {
// Passed
$newstr = $str;
}
else {
// Failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}
else {
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
// Skip
}
else {
$newstr = utf8_encode($newstr);
}
// Test encoding:
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
print 'UTF-8 :D<br/>';
}
else {
print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
}
return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
}

Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.
You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.

Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.info or the initial patch for an explanation.
To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ which can only be from the Unicode charset (and, in this example, not the Korean charset).

Set UTF-8 as the character set for all headers output by your PHP code.
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

HTML DOM PARSER UTF-8 - php

to Substr utf-8 strings you can use a function like this: function substrutf8($str,$from,$len){ return preg_replace('#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'. $from .'}'.'((?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){0,'. $len .'}).*#s','$1', $str);}

If you can't use UTF-8, you must convert it to some other encoding: $yazi = mb_convert_encoding("<span>$trozet</span>", "Windows-1250", "UTF-8"); Note that not every website will be in UTF-8 and Windows-1250 supports only a tiny subset of Unicode characters anyway.

Related

PHP using UTF8 characters in URL, url encoding fails

Why encoding doesn't work in function?

Converting SMS encoding to UTF-8 in PHP

PHP - ASCII special characters (without MySQL)

How to handle user input of invalid UTF-8 characters

Categories

Resources