Sometimes when a user is copying and pasting data into an input form we get characters like the following:
didn’t,“ for beginning quotes and †for end quote, etc ...
I use this routine to sanitize most input on web forms (I wrote it a while ago but am also looking for improvements):
function fnSanitizePost($data) //escapes,strips and trims all members of the post array
{
if(is_array($data))
{
$areturn = array();
foreach($data as $skey=>$svalue)
{
$areturn[$skey] = fnSanitizePost($svalue);
}
return $areturn;
}
else
{
if(!is_numeric($data))
{
//with magic quotes on, the input gets escaped twice, which means that we have to strip those slashes. leaving data in your database with slashes in them, is a bad idea
if(get_magic_quotes_gpc()) //gets current configuration setting of magic quotes
{
$data = stripslahes($data);
}
$data = pg_escape_string($data); //escapes a string for insertion into the database
$data = strip_tags($data); //strips HTML and PHP tags from a string
}
$data = trim($data); //trims whitespace from beginning and end of a string
return $data;
}
}
I really want to avoid characters like I mention above from ever getting stored in the database, do I need to add some regex replacements in my sanitizing routine?
Thanks,
- Nicholas
didn’t,“ for beginning quotes and †for end quote
That's not junk, those are legitimate “smart quote” characters that have been passed to you encoded as UTF-8, but read, incorrectly, as ISO-8859-1.
You can try to get rid of them or try to parse them into plain old Latin-1 using utf_decode, but if you do you'll have an application that won't let you type anything outside ASCII, which in this day and age is a pretty poor show.
Better if you can manage it is to have all your pages served as UTF-8, all your form submissions coming in as UTF-8, and all your database contents stored as UTF-8. Ideally, your application would work internally with all Unicode characters, but unfortunately PHP as a language doesn't have native Unicode strings, so it's usually a case of holding all your strings also as UTF-8, and taking the risk of occasionally truncating a UTF-8 sequence and getting a �, unless you want to grapple with mbstring.
$data = pg_escape_string($data); //escapes a string for insertion into the database
$data = strip_tags($data); //strips HTML and PHP tags from a string
You don't want to do that as a sanitisation measure coming into your application. Keep all your strings in plain text form for handling them, then pg_escape_string() only on the way out to a Postgres query, and htmlspecialchars() only on the way out to an HTML page.
Otherwise you'll get weird things like SQL escapes appearing on variables that have passed straight through the script to the output page, and no-one will be able to use a plain less-than character.
One thing you can usefully do as a sanitisation measure is to remove any control codes in strings (other than newlines, \n, which you might conceivably want).
$data= preg_replace('/[\x00-\x09\x0B-\x19\x7F]/', '', $data);
You want to check out PHP's utf_decode function: Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1. It seems you're getting UTF characters and the database is not able to handle those.
Another solution is to change the encoding of the database, if possible.
I finally came up with a routine for replacing these characters. It took parsing some of the problematic strings one character at a time and returning the octal value of each character. In doing so I learned that smart quote characters come back as sets of 3 octal values. Here is routine I used to parse the string:
$str = "string_with_smart_quote_chars";
$ilen = strlen($str);
$sords = NULL;
echo "$str\n\n";
for($i=0; $i<$ilen; $i++)
{
$sords .= ord(substr($str, $i, 1))." ";
}
echo "$sords\n\n";
Here are the str_replace() calls to "fix" the string:
$str = str_replace(chr(226).chr(128).chr(156), '"', $str); // start quote
$str = str_replace(chr(226).chr(128).chr(157), '"', $str); // end quote
$str = str_replace(chr(226).chr(128).chr(153), "'", $str); // for single quote
I am going to continue building up an array of these search/replacements which I am sure will continue to grow with the increasing use of these types of characters.
I know that there are some canned routines for replacing these but I had no luck with any of them on the Solaris 10 platform that my scripts are running on.
-- Nicholas
Zend Framework's Zend_Filter and Zend_Filter_Input has very good tools for this.
Related
Ok, so I have some JSON, that when decoded, I print out the result. Before the JSON is decoded, I use stripslashes() to remove extra slashes. The JSON contains website links, such as https://www.w3schools.com/php/default.asp and descriptions like Hello World, I have u00249999999 dollars
When I print out the JSON, I would like it to print out
Hello World, I have $9999999 dollars, but it prints out Hello World, I have u00249999999 dollars.
I assume that the u0024 is not getting parsed because it has no backslash, though the thing is that the website links' forward slashes aren't removed through strip slashes, which is good - I think that the backslashes for the Unicode symbols are removed with stripslashes();
How do I get the PHP to automatically detect and parse the Unicode dollar sign? I would also like to apply this rule to every single Unicode symbol.
Thanks In Advance!
According to the PHP documentation on stripslashes (), it
un-quotes a quoted string.
Which means, that it basically removes all backslashes, which are used for escaping characters (or Unicode sequences). When removing those, you basically have no chance to be completely sure that any sequence as "u0024" was meant to be a Unicode entity, your user could just have entered that.
Besides that, you will get some trouble when using stripslashes () on a JSON value that contains escaped quotes. Consider this example:
{
"key": "\"value\""
}
This will become invalid when using stripslashes () because it will then look like this:
{
"key": ""value""
}
Which is not parseable as it isn't a valid JSON object. When you don't use stripslashes (), all escape sequences will be converted by the JSON parser and before outputting the (decoded) JSON object to the client, PHP will automatically decode (or "convert") the Unicode sequences your data may contain.
Conclusion: I'd suggest not to use stripslashes () when dealing with JSON entities as it may break things (as seen in the previous example, but also in your problem).
Your assumption is correct: u0024 is not getting parsed because it has no backslash. You can use regex to add backslash back after the conversion.
It looks like you have UTF-8 encoded strings internally, PHP outputs them properly, but your browser fails to auto-detect the encoding (it decides for ISO 8859-1 or some other encoding).
The best way is to tell the browser that UTF-8 is being used by sending the corresponding HTTP header:
header("content-type: text/html; charset=UTF-8");
Then, you can leave the rest of your code as-is and don't have to html-encode entities or create other mess.
If you want, you can additionally declare the encoding in the generated HTML by using the <meta> tag:
<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML <=4.01
<meta charset="UTF-8">
for HTML5
HTTP header has priority over the <meta> tag, but the latter may be useful if the HTML is saved to HD and then read locally.
The main question you have to understand, is why do you need to strip slashes?
And, if it is really necessary to strip slashes, how to manage the encoding? Probably it is a good idea to convert unicode symbols before to strip slashes, not after, using html_entity_decode .
Anyway, you can try fix the problem with this workaround:
$string = "Hello World, I have u00249999999 dollars";
$string = preg_replace( "/u([0-9A-F]{0,4})/", "&#x$1;", $string ); // recover "u" + 4 alnums
$string = html_entity_decode( $string, ENT_COMPAT, 'UTF-8' ); // convert to utf-8
I have a string such as this - Panamá. I need to convert this string to Panam\xE1 so it's readable in a JavaScript file I'm generating using PHP.
Is there a function to encode this in PHP? Any ideas would be appreciated.
My rule is,
If you try to encode or escape data using preg_replace or
using massive mapping arrays or str_replace, STOP you are probably doing it wrong.
All it takes is one missed or eroneous mapping (and you WILL miss some mappings) then you end up with code that doesn't work in all cases and code which corrupts your data in some cases. Whole libraries have been written already dedicated to doing the translations for you (e.g. iconv) and for escaping data, you should use the proper PHP function.
If you plan on outputting the data to a browser (the fact you want to encode for javascript suggests this) then I suggest using UTF8 encoding. If your data is in latin-1, use the utf8_encode function.
Whether your PHP string contains ASCII characters or not, to send any data from PHP to JS you should ALWAYS use the json_encode function.
PHP code
$your_encoding = 'latin1';
$panama = "Panamá";
//Get your data in utf8 if it isnt already
$panama = iconv($your_encoding, "utf-8", $panama);
$panama_encoded = json_encode($panama);
echo "var js_panama = " . $panama_encoded . ";";
JS Output
var js_panama = "Panam\u00e1";
Even though JSON supports unicode, it may not be compatible with your non UTF-8 javascript file. This is not a problem because the json_encode PHP function will escape unicode characters by default.
Assuming that your input is in the latin-1 encoding then ord and dechex will do what you want:
$result = preg_replace_callback(
'/[\x80-\xff]/',
function($match) {
return '\x'.dechex(ord($match[0]));
},
$input);
If your input is in any other encoding then you would need to know what encoding that is and adapt the solution accordingly. Note that in this case it would not be possible to use specifically the \x## notation in the JS output in all cases.
This should work for you:
$str = "Panamá";
$str = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$utf = iconv('UTF-8', 'UCS-4', current($m));
return sprintf("\x%s", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $str);
echo $str;
Output (Source Code):
Panam\xE1
I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â
If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.
I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.
Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.
In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.
I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");
i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.
I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?
HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).
You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.
It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.
I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &).