Detect Encoding and Convert Everything to UTF-8 with PHP

Detect Encoding and Convert Everything to UTF-8 with PHP - php

I want to extract various data from URLs that will be converted to UTF-8 no matter what the encoding methods is used in original page (or at least it will work on most of the source encodings).
So, after looking and searching many discussions and answers, I finally came with the following code, with which I am parsing HTML data twice (once for detecting encoding and a second time for getting the actual data). This is working at least on all the checked URLs. But I think that the code is poorly written.
Can anyone let me know if there are any better alternatives to do the same or if I need any improvements on the code?
<?php
header('Content-Type: text/html; charset=utf-8');
require_once 'curl.php';
require_once 'curl_response.php';
$curl = new Curl;
$url = "http://" . $_GET['domain'];
$curl_response = $curl->get($url);
$header_content_type = $curl_response->headers['Content-Type'];
$dom_doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $curl_response);
libxml_use_internal_errors(FALSE);
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('http-equiv')) == 'content-type') {
$meta_content_type = $meta->getAttribute('content');
}
if ($meta->getAttribute('charset') != '') {
$html5_charset = $meta->getAttribute('charset');
}
}
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
$charset = $m[1];
} elseif (!empty($html5_charset)) {
$charset = $html5_charset;
} elseif (preg_match('/encoding=(.+)/', $curl_response, $m)) {
$charset = $m[1];
} else {
// browser default charset
// $charset = 'ISO-8859-1';
}
if (!empty($charset) && $charset != "utf-8") {
$tmp = iconv($charset,'utf-8', $curl_response);
libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $tmp);
libxml_use_internal_errors(FALSE);
}
$page_title = $dom_doc->getElementsByTagName('title')->item(0)->nodeValue;
$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
if (strtolower($meta->getAttribute('name')) == 'description') {
$meta_description = $meta->getAttribute('content');
}
if (strtolower($meta->getAttribute('name')) == 'keywords') {
$meta_tags = $meta->getAttribute('content');
}
}
print $charset;
print "<hr>";
print $page_title;
print "<hr>";
print $meta_description;
print "<hr>";
print $meta_tags;
print "<hr>";
print "Memory Peak Usages: " . memory_get_peak_usage()/1024/1024 . " MB";
?>

Your question is too open-ended, and I've voted to close it. However, I will still provide a stub of an answer that will, hopefully, point you in the right direction.
At the moment, you are checking user-defined input for the charset. This is a very, very, very bad move, for various reasons:
Most webmasters on small site will just header("Content-type: text/html; charset=utf-8") because they've heard it is good practice, without actually encoding. Not taking this into account will lead to mangled UTF-8 outputs
Some webmasters do the opposite: they do not set a header, and their webserver outputs ISO-8859-1 headers despite an UTF-8 encoding. Visibly on a page, this does not matter - it matters for DOMDocument (I've had this issue recently)
iconv double utf-8 encoding is never fun.
I'd strongly advise using a utility to decode UTF-8 until there are no more entities within the UTF-8 extended range of characters and then encoding once rather than relying on iconv or multibyte encoding. The reason is simple: these can get it wrong. You can also set an error handler to parse DOMDocument errors in order to catch and redirect the loadXML "failed due to malformed XML" errors, which will not be related to your character encoding at all. Basically, the key to you problem is to not blindly do stuff.
If you'd like good targets where you need to worry about UTF-8, parse the home page of Google Play. They send out malformed replies (which is what initially forced me to go through the UTF-8-decode-until-nothing-is-in-the-range approach). It will also show you that DOMDocument can fail due to a wide variety of reasons - not just charset - and that you need to follow the errors to deal with them.
Other performance pointers outside of that big encoding snafu include:
Fragmenting your code into resultant functions. You've got a lot of repetition in there - learn to use functions to stop having to explicitely write the same core functions multiple times.
This:
if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
$charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
is horrible. You can easily replace it with a strpos call, which will speed this particular set of ifs by about 5-10x.
* $metas = $dom_doc->getElementsByTagName('meta'); - you're aware that DOMDocument will go through your entire DOM when you use this method, right? Consider restricting the XPath query to just the head tag (which is always the first child of html, which is the document. XPath: /html/head[0])

In regard to performance you should be using unset(); when you're done with variables or values even if you're going to reset their values, but not if you need the value further down your script. PHP cannot reclaim memory and will reuse the preallocated memory released from the unset command for future use.
Another thing you could do is take huge chunks of that code and split it into functions that return resultant values. Remember that function variables and memory are automatically released after execution unless you're working with global variables.
Those will help performance and memory utilization.

Related

PHP HP Fortify sends unvalidated data to a web browser, which can result in the browser executing malicious code

I have a PHP script as below:
10. $json_sanitized = ds($json);
11. echo json_encode ( $json_sanitized );
The ds() function has few rules to sanitize the $json data.
function ds($text, $double = true, $charset = null) {
if (is_array($text)) {
// Some code
} elseif (is_object($text)) {
// Some code
} elseif (is_bool($text)) {
// Some code
}
$defaultCharset = 'UTF-8';
if (is_string($double)) {
$charset = $double;
}
return htmlspecialchars($text, ENT_QUOTES, ($charset) ? $charset : $defaultCharset, $double);
}
But the HP Fortify Scanner still says, Line #11, sends unvalidated data to a web browser, which can result in the browser executing malicious code.
Can anyone help on this?

Per a few other answers on this site, the json_encode function in PHP is generally safe, and there are some options that can help make it safer though additional escaping.
Using the following helps to escape more potentially unsafe characters that Fortify picks up:
echo json_encode($json_sanitized,JSON_HEX_QUOT|JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_APOS);
Per the json_encode docs and the json constants docs, these constants provide the following (optional) conversions:
JSON_HEX_QUOT - All " are converted to \u0022.
JSON_HEX_TAG - All < and > are converted to \u003C and \u003E.
JSON_HEX_AMP - All &s are converted to \u0026.
JSON_HEX_APOS - All ' are converted to \u0027.
You may be able to skip the escaping of the single and double quotes, as I imagine the biggest gripe is that <> are able to be printed unescaped.
Json: PHP to JavaScript safe or not?
Is json_encode Sufficient XSS Protection?

PHP: Converting xml to array

I have an xml string. That xml string has to be converted into PHP array in order to be processed by other parts of software my team is working on.
For xml -> array conversion i'm using something like this:
if(get_class($xmlString) != 'SimpleXMLElement') {
$xml = simplexml_load_string($xmlString);
}
if(!$xml) {
return false;
}
It works fine - most of the time :) The problem arises when my "xmlString" contains something like this:
<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>
Then, simplexml_load_string won't do it's job (and i know that's because of character "<").
As i can't influence any other part of the code (i can't open up a module that's generating XML string and tell it "encode special characters, please!") i need your suggestions on how to fix that problem BEFORE calling "simplexml_load_string".
Do you have some ideas? I've tried
str_replace("<","<",$xmlString)
but, that simply ruins entire "xmlString"... :(

Well, then you can just replace the special characters in the $xmlString to the HTML entity counterparts using htmlspecialchars() and preg_replace_callback().
I know this is not performance friendly, but it does the job :)
<?php
$xmlString = '<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>';
$xmlString = preg_replace_callback('~(?:").*?(?:")~',
function ($matches) {
return htmlspecialchars($matches[0], ENT_NOQUOTES);
},
$xmlString
);
header('Content-Type: text/plain');
echo $xmlString; // you will see the special characters are converted to HTML entities :)
echo PHP_EOL . PHP_EOL; // tidy :)
$xmlobj = simplexml_load_string($xmlString);
var_dump($xmlobj);
?>

What kind of security loopholes could this creative way of sanitizing input, possibly face? (if any)

The standard way of sanitizing input would be to use commands such as
$url = preg_replace('|[^a-z0-9-~+_.?#=!&;,/:%#$\|*\'()\\x80-\\xff]|i', '', $url);
$strip = array('%0d', '%0a', '%0D', '%0A');
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
echo htmlentities($str);
However, I like it when my users are able to use nice things like parentheses, carats, quotes, etc in their inputs, comments/usernames/etcetc. Since HTML renders codes such as ( into symbols such as (, I was hoping to use this alternative approach to sanitizing their input.
Before I embarked on writing a function to do this for possibly harmful characters such as ( or ; or < (so injections such as sneaky eval() or <text/javascript> would not work) I tried searching up previous people's attempts at doing this type of sanitization.
I found none.
This makes me think that I must be clearly overlooking some incredibly obvious security flaw in my "creative" sanitization method.
I will not be using this function as the primary way to protect my mySQL database. I have the new mysqli class for that. Adding this sanitization overtop of the mysqli separation of input & query seems like a nice idea, though.
I am using a completely different function to clean up URLs. Those require a different approach.
This function will be used for user input to be displayed on the page, though.
So .... what could I possibly be missing? I KNOW there's gotta be something wrong with this idea since no one else uses it, right?! Is it possible to "re-render the rendered text" or something else horrific and obvious? My pretty little function so far:
Takes input strings like meep';) drop table or
alert(eval('document.body.inne' + 'rHTML'));
function santitize_data($data) {
//explode the string
//do a replacement for each character separately. Only do one replacement.
//dont do it with preg_replace because that function searches through a string in multiple passes
//and replaces already-replaced characters, resulting in horrific mishmash.
//put it back together with + signs iterating through array variables
$patterns = array();
$patterns[0] = "'";
$patterns[1] = '"';
$patterns[2] = '!';
$patterns[3] = '\\';
$patterns[4] = '#';
$patterns[5] = '%';
$patterns[6] = '&';
$patterns[7] = '$';
$patterns[8] = '(';
$patterns[9] = ')';
$patterns[10] = '/';
$patterns[11] = ':';
$patterns[12] = ';';
$patterns[13] = '|';
$patterns[14] = '<';
$patterns[15] = '>';
$patterns[16] = '{';
$patterns[17] = '}';
$replacements = array();
$replacements[0] = ''';
$replacements[1] = '"';
$replacements[2] = '&#33';
$replacements[3] = '\';
$replacements[4] = '#';
$replacements[5] = '%';
$replacements[6] = '&';
$replacements[7] = '$';
$replacements[8] = '(';
$replacements[9] = ')';
$replacements[10] = '/';
$replacements[11] = ':';
$replacements[12] = ';';
$replacements[13] = '|';
$replacements[14] = '<';
$replacements[15] = '>';
$replacements[16] = '{';
$replacements[17] = '}';
$split_data = str_split($data);
foreach ($split_data as &$value) {
for ($i=0; $i<17; $i++){
//testing
//echo '<br> i='.$i.' value='.$value.' patterns[i]='.$patterns[$i].' replacements[i]='.$replacements[$i].'<br>';
if ($value == $patterns[$i]) {
$value = $replacements[$i];
$i=17; } } }
unset($value); // break the reference with the last element
$data = implode($split_data);
//a bit of commented out code .. was using what seemed more logical before ... preg_replace .. but it parses the string in multiple passes ):
//$data = preg_replace($patterns, $replacements, $data);
return $data;
} //---END function definition of santitize_data
Spits out result strings like meep';) drop table or
alert(eval('document.body.inne' + 'rHTML'));
and the user sees these things rendered in the browser like like meep';) drop table and
alert(eval('document.body.inne' + 'rHTML'));

Without analyzing your code I can tell you that there is a high probability that you've overlooked something that an attacker could use to inject their own code.
The main threat here is XSS - you shouldn't need to "sanitize" to insert data into a database. You either use parameterised queries or you correctly encode characters that the database query language confers special meaning to at the point of entry into your database (e.g. ' character). XSS is normally dealt with by encoding at the point of output, however if you want to allow rich text then you need to take a different approach which is what I believe you are looking to achieve here.
Remember there is no magic function that sanitizes input in a generic manner - it very much depends on how and where it is used to determine whether it is safe or not in that context. (This bit added so if anyone searches and finds this answer then they are up to speed - I think you're already on top of this though.)
Complexity is the main enemy of security. If you cannot determine whether your code is safe or not it is too complicated and a sufficiently motivated attacker with enough time will find a way round your sanitization methods.
What can you do about this?
If you want to allow your users to enter rich text you could either allow BBCode to allow users to insert a limited, safe subset of HTML via your own conversion functions or you could allow HTML entry and run the content through a tried and tested solution such as HTML Purifier. Now, HTML Purifier won't be perfect and I'm sure that (another) flaw will be found in it at some point in the future.
How to guard against this?
If you implement a Content Security Policy on your site, this will prevent any successfully injected script code from executing in the browser. See here for current browser support for CSP. Don't be tempted to just use one of these methods - a good security model has layered security so if one control is circumvented, the other can catch it.
Google have now implemented CSP in Gmail to ensure any HTML email received cannot try anything sneaky to launch an XSS attack.

Json to xml with greek characters

I am using curl to get a json file which can be located here: (It's way too long to copy paste it): http://www.opap.gr/web/services/rs/betting/availableBetGames/sport/program/4100/0/sport-1.json?localeId=el_GR
After that i use json_decode to get the assosiative array.Till here everything seems ok.When i am using var_dump the characters inside the array are in Greek.After that i am using the following code:
$JsonClass = new ArrayToXML();
$mydata=$JsonClass->toXml($json);
class ArrayToXML
{
public static function toXML( $data, $rootNodeName = 'ResultSet', &$xml=null ) {
// turn off compatibility mode as simple xml throws a wobbly if you don't.
// if ( ini_get('zend.ze1_compatibility_mode') == 1 ) ini_set ( 'zend.ze1_compatibility_mode', 0 );
if ( is_null( $xml ) ) //$xml = simplexml_load_string( "" );
$xml = simplexml_load_string("<?xml version='1.0' encoding='UTF-8'?><$rootNodeName />");
// loop through the data passed in.
foreach( $data as $key => $value ) {
$numeric = false;
// no numeric keys in our xml please!
if ( is_numeric( $key ) ) {
$numeric = 1;
$key = $rootNodeName;
}
// delete any char not allowed in XML element names
`enter code here`$key = preg_replace('/[^a-z0-9\-\_\.\:]/i', '', $key);
// if there is another array found recrusively call this function
if ( is_array( $value ) ) {
$node = ArrayToXML::isAssoc( $value ) || $numeric ? $xml->addChild( $key ) : $xml;
// recrusive call.
if ( $numeric ) $key = 'anon';
ArrayToXML::toXml( $value, $key, $node );
} else {
// add single node.
$value = htmlentities( $value );
$xml->addChild( $key, $value );
}
}
// pass back as XML
return $xml->asXML();
}
public static function isAssoc( $array ) {
return (is_array($array) && 0 !== count(array_diff_key($array, array_keys(array_keys($array)))));
}
}
And here comes the problem .All the greek characters inside the result are in some strange characters Î?Î?Î¥Î?Î?Î¡Î©Î£Î?Î? for example.I really don't know what am i doing wrong.I am really bad with encoding /decoding things :(.
And to make this a bit more clear:
Here is how the assosiative array (on of the parts that i have the problem with) looks like:
{ ["resources"]=> array(4) { ["team-4833"]=> string(24) "ΛΕΥΚΟΡΩΣΙΑ U21" ["t-429"]=> string(72) "ΠΡΟΚΡΙΜΑΤΙΚΑ ΕΥΡΩΠΑΪΚΟΥ ΠΡΩΤΑΘΛΗΜΑΤΟΣ" ["t-429-short"]=> string(6) "ΠΕΠ" ["team-15387"]=> string(16) "ΕΛΛΑΔΑ U21" } ["locale"]=> string(5) "el_GR" } ["relatedNum"]=> NULL }
And here is what i get after the use of simplexml
<resources><team-4833>Î?Î?Î¥Î?Î?Î¡Î©Î£Î?Î? U21</team-4833><t-429>Î Î¡Î?Î?Î¡Î?Î?Î?Î¤Î?Î?Î? Î?Î¥Î¡Î©Î Î?ÎªÎ?Î?Î¥ Î Î¡Î©Î¤Î?Î?Î?Î?Î?Î?Î¤Î?Î£</t-429><t-429-short>Î Î?Î </t-429-short><team-15387>Î?Î?Î?Î?Î?Î? U21</team-15387></resources><locale>el_GR</locale></lexicon><relatedNum></relatedNum></betGames>
Thanks in advance for your replies.
PS:I have also <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> in the page i display the result but it doesnt help.
I still didn't find a solution with that so i used a different approach something like Yannis suggested.I saved the XML in a file using the class i found here http://www.phpclasses.org/package/1826-PHP-Store-associative-array-data-on-file-in-XML.html .
After that i load the xml with simplexml_load_file and i used xslt to access the data in all nodes and store it in my database.It worked fine that way .If anyone still wants to try and explain me why it doesn't work with the way i tried to do it at the start feel free (Just for the learning purpose :p)Thanks for your replies :).

There is no need - The current json is given in an xml format as well here apparently:
http://www.opap.gr/web/services/rs/betting/availableBetGames/sport/program/4100/0/sport-1.xml?localeId=el_GR
Just had to play with the url parameters a bit :)

This worked for me on chrome using php version 5.3.6:
$json = file_get_contents('http://www.opap.gr/web/services/rs/betting/availableBetGames/sport/program/4100/0/sport-1.json?localeId=el_GR');
$json = json_decode($json, true);
$xml = new SimpleXMLElement('<ResultSet/>');
array_walk_recursive($json, array ($xml, 'addChild'));
print $xml->asXML();
exit();

Clearly your bug is that you are manipulating UTF‑8–encoded Unicode as though those bytes were ISO‐8859‑1.
I cannot see where this is happening; probably in your call to htmlentities, whatever that is.
It may need to use some sort of “multibyte” hack, perhaps including such things as this sort of pattern:
/([^\x00-\x7F])/u
wiht an explicit /u so it works on logical code points instead of 8‑bit code units (read: bytes). It might do this to grab one non-ASCII code point so it can replace it with a numeric entity. Without the easily forgotten /u, it would work on bytes not code points, which matches what your description shows happening.
It could be this sort of thing, or it might be that you have to swap over to some of the mb_*() functions instead of normal ones. This is to work around the fundamental underlying PHP bug that there it no real Unicode support in the language, just a few band-aides here and there that seem to like to fall off from time to time for no good reason.
If you could use a clean language with not just proper Unicode support but also a clear separation between physical bytes and abstract characters, this sort of thing would not be happening. But I bet it’s a common problem that others must be having too, so I would be really surprised if it were a library bug instead of a (perfectly understandable!) oversight somewhere in your code.

answer in your question from GREECE---------
word "? [ΛΕΥΚΟ]"? it has ASC (his code character) 203-197-213-202-207 ()----------
when however you read him [prostithete] the 206 and are doubled the letters----------
but also change code as following 206-(203-48=155)-206-(197-48=149)-206-(213-48=165)-
-206-(213-48=165)-206-(202-48=154)-206-(207-48=159)-------------
consequently the solution they is checking to a character if you find the 206 to >ignore---------
him and in the ASC of next character to add number 48 and to find the new character. >------------
Because I deal also i with the [ΑΠΟΚΟΔΙΚΟΠΟΙΗΣΗ] of [ΟΠΑΠ] every new knowledge they is >[ΕΥΠΡΟΣΔΕΚΤΟ]------
in mail -->? bluegt03#in.gr

Building XML with PHP - Performance in mind

When building XML in PHP, is it quicker to build a string, then echo out the string or to use the XML functions that php gives you? Currently I'm doing the following:
UPDATED to better code snippet:
$searchParam = mysql_real_escape_string($_POST['s']);
$search = new Search($searchParam);
if($search->retResult()>0){
$xmlRes = $search->buildXML();
}
else {
$xmlRes = '<status>no results</status>';
}
$xml = "<?xml version=\"1.0\"?>";
$xml.="<results>";
$xml.=$xmlRes;
$xml.="</results>"
header ("content-type: text/xml");
header ("content-length: ".strlen($xml));
echo($xml);
class Search {
private $num;
private $q;
function __construct($s){
$this->q = mysql_query('select * from foo_table where name = "'.$s.'"');
$this->num = mysql_num_rows($this->q);
}
function retResult(){
return $this->num;
}
function buildXML(){
$xml ='<status>success</status>';
$xml.='<items>';
while($row = mysql_fetch_object($this->q)){
$xml.='<item>';
$desTag = '<info><![CDATA[';
foreach ($row as $key => $current){
if($key=='fob'){
//do something with current
$b = mysql_query('select blah from dddd where id ='.$current);
$a = mysql_fetch_array($b);
$xml.='<'.$key.'>'.$a['blah'].'</'.$key.'>';
}
else if($key =='this' || $key=='that'){
$desTag = ' '.$current;
}
else {
$xml.='<'.$key.'>'.$current.'</'.$key.'>';
}
}
$desTag.= ']]></info>';
$xml.=$desTag;
$xml.='</item>';
}
$xml.='</items>';
return $xml;
}
}
Is there a faster way of building the xml? I get to about 2000 items and it starts to slow down..
Thanks in advance!

Use the xml parser. Remember when you concatenate a string, you have to reallocate the WHOLE STRING on every concatenation.
For small strings, string is is probably faster, but in your case definitely use the XML functions.

I don't see that you're making no attempt to escape the text before concatenating it. Which means that sooner or later you're going to generate something that is almost-but-not-quite XML, and which will be rejected by any conforming parser.
Use a library (XMLWriter is probably more performant than others, but I haven't done XML with PHP).

You have a SQL query inside of a loop, which is usually quite a bad idea. Even if each query takes half a millisecond to complete, it's still a whole second just to execute those 2000 queries.
What you need to do is post the two queries in a new question so that someone can show you how to turn them into a single query using a JOIN.
Database stuff usually largely outweighs any kind of micro-optimization. Whether you use string concatenation or XMLWriter doesn't matter when you're executing several thousand queries.

try to echo in each iteration (put the echo $xml before the while loop ends, and reset $xml at the beggining), should be quicker

That code snippet doesn't make a lot of sense, please post some actual code, reduced for readability.
A faster version of the code you posted would be
$xml = '';
while ($row = mysql_fetch_row($result))
{
$xml .= '<items><test>' . implode('</test><test>', $row) . '</test></items>';
}
In general, using mysql_fetch_object() is slightly slower than the other options.
Perhaps what you were trying to do was something like this:
$xml = '<items>';
while ($row = mysql_fetch_assoc($result))
{
$xml .= '<item>';
foreach ($row as $k => $v)
{
$xml .= '<' . $k . '>' . htmlspecialchars($v) . '</' . $v . '>';
}
$xml .= '</item>';
}
$xml .= '</items>';
As mentionned elsewhere, you have to escape the values unless you're 100% sure there will never be any special character such as "<" ">" or "&". This also applies to $k actually. In that kind of script, it is also generally more performant to use XML attributes instead of nodes.
With so little information about your goal, all we can do is micro-optimize. Perhaps you should work on the principles behind your script instead? For instance, do you really have to generate 2000 items. Can you cache the result, can you cache anything? Can't you paginate the result, etc...
Quick word about using PHP's XML libraries, XMLWriter will generally be slightly slower than using string manipulation. Everything else will be noticeably slower than strings.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.