PHP trim and space not working - php

I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â

If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.

I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.

Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.

In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.

I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");

Related

Problem when reading file with non-English characters in PHP

Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:
while(!feof($handle)) {
$line = fgets($handle);
}
The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.
$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));
I would appreciate so much if anyone can help me out this issue.
You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.
If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Whitespace in a database field is not removed by trim()

I have some whitespace at the begining of a paragraph in a text field in MySQL.
Using trim($var_text_field) in PHP or TRIM(text_field) in MySQL statements does absolutely nothing. What could this whitespace be and how do I remove it by code?
If I go into the database and backspace it out, it saves properly. It's just not being removed via the trim() functions.
function UberTrim($s) {
$s = preg_replace('/\xA0/u', ' ', $s); // strips UTF-8 NBSP: "\xC2\xA0"
$s = trim($s);
return $s;
}
The UTF-8 character encoding for a no-break space, Unicode (U+00A0), is the 2-byte sequence C2 A0. I tried to make use of the second parameter to trim() but that didn't do the trick. Example use:
assert("abc" === UberTrim(" \r\n \xc2\xa0 abc \t \xc2\xa0 "));
A MySQL replacement for TRIM(text_field) that also removes UTF no-break spaces, thanks to #RudolfRein's comment:
TRIM(REPLACE(text_field, '\xc2\xa0', ' '))
UTF-8 checklist:
(more checks here)
Make sure your PHP source code editor is in UTF-8 mode without BOM. Or set in the preferences.
Make sure your MySQL client is set for UTF-8 character encoding (more here and here), e.g.
$pdo = new PDO('mysql:host=...;dbname=...;charset=utf8',$userid,$password);
$pdo->exec("SET CHARACTER SET utf8");
Make sure your HTTP server is set for UTF-8, e.g. for Apache:
AddDefaultCharset UTF-8
Make sure the browser expects UTF-8.
header('Content-Type: text/html; charset=utf-8');
or
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
If the problem is with UTF-8 NBSP, another simple option is:
REPLACE(the_field, UNHEX('C2A0'), ' ')
The best solution is a combination of a few things mentioned to you already.
First run ORD() on the string in question. In my case I had to run a reverse first because my problem character was at the end of the string.
ORD(REVERSE([col name]))
Once you discover the problematic char, run a
REPLACE([col_name], char([char_value_returned]), char(32))
Finally, call a proper
TRIM([col_name])
This will completely eradicate the problem character from all aspects of the string, and trim off the leading (in my case trailing) character.
Try using the MySQL ORD() function on the text_field to check the character code of the left-most character. It can be a non-whitespace characters that appears like whitespace.
you have to detect these "whitespace" characters first. if it's some HTML entity, like no trimming function would help, of course.
I'd suggest to print it out like this
echo urlenclde($row['field']);
and see what it says
Well as its A0 (or 160 decimal) non-breaking space character, you can convert it to ordinal space first:
<pre><?php
$str = urldecode("%A0")."bla";
var_dump(trim($str));
$str = str_replace(chr(160)," ",$str);
$str = trim($str);
var_dump($str);
and, ta-dam! -
string(4) " bla"
string(3) "bla"
Try to check what character each "whitespace" is by writing the charactercode out - It might be a non-visible charactertype that isn't removed by trim.
Trim only removes a few characters such as whitespace, tab, newline, CR and NUL but there exist other non-visible characters that might cause this problem.
try
str_ireplace(array("\r", "\n", "\t"), $var_text_field

Is replacing a line break UTF-8 safe?

If I have a UTF-8 string and want to replace line breaks with the HTML <br> , is this safe?
$var = str_replace("\r\n", "<br>", $var);
I know str_replace isn't UTF-8 safe but maybe I can get away with this. I ask because there isn't an mb_strreplace function.
UTF-8 is designed so that multi-byte sequences never contain an anything that looks like an ASCII-character. That is, any time you encounter a byte with a value in the range 0-127, you can safely assume it to be an ASCII character.
And that means that as long as you only try to replace ASCII characters with ASCII characters, str_replace should be safe.
str_replace() is safe for any ascii-safe character.
Btw, you could also look at the nl2br()
1st: Use the code-sample markup for code in your questions.
2nd: Yes, it is save.
3rd: It may not be what you want to archieve. This could be better:
$var = str_replace(array("\r\n", "\n", "\r"), "<br/>", $var);
Don't forget that different operating systems handle line breaks different. The code above should replace all line breaks, no matter where they come from.

Routine for removing ALL junk from incoming strings?

Sometimes when a user is copying and pasting data into an input form we get characters like the following:
didn’t,“ for beginning quotes and †for end quote, etc ...
I use this routine to sanitize most input on web forms (I wrote it a while ago but am also looking for improvements):
function fnSanitizePost($data) //escapes,strips and trims all members of the post array
{
if(is_array($data))
{
$areturn = array();
foreach($data as $skey=>$svalue)
{
$areturn[$skey] = fnSanitizePost($svalue);
}
return $areturn;
}
else
{
if(!is_numeric($data))
{
//with magic quotes on, the input gets escaped twice, which means that we have to strip those slashes. leaving data in your database with slashes in them, is a bad idea
if(get_magic_quotes_gpc()) //gets current configuration setting of magic quotes
{
$data = stripslahes($data);
}
$data = pg_escape_string($data); //escapes a string for insertion into the database
$data = strip_tags($data); //strips HTML and PHP tags from a string
}
$data = trim($data); //trims whitespace from beginning and end of a string
return $data;
}
}
I really want to avoid characters like I mention above from ever getting stored in the database, do I need to add some regex replacements in my sanitizing routine?
Thanks,
- Nicholas
didn’t,“ for beginning quotes and †for end quote
That's not junk, those are legitimate “smart quote” characters that have been passed to you encoded as UTF-8, but read, incorrectly, as ISO-8859-1.
You can try to get rid of them or try to parse them into plain old Latin-1 using utf_decode, but if you do you'll have an application that won't let you type anything outside ASCII, which in this day and age is a pretty poor show.
Better if you can manage it is to have all your pages served as UTF-8, all your form submissions coming in as UTF-8, and all your database contents stored as UTF-8. Ideally, your application would work internally with all Unicode characters, but unfortunately PHP as a language doesn't have native Unicode strings, so it's usually a case of holding all your strings also as UTF-8, and taking the risk of occasionally truncating a UTF-8 sequence and getting a �, unless you want to grapple with mbstring.
$data = pg_escape_string($data); //escapes a string for insertion into the database
$data = strip_tags($data); //strips HTML and PHP tags from a string
You don't want to do that as a sanitisation measure coming into your application. Keep all your strings in plain text form for handling them, then pg_escape_string() only on the way out to a Postgres query, and htmlspecialchars() only on the way out to an HTML page.
Otherwise you'll get weird things like SQL escapes appearing on variables that have passed straight through the script to the output page, and no-one will be able to use a plain less-than character.
One thing you can usefully do as a sanitisation measure is to remove any control codes in strings (other than newlines, \n, which you might conceivably want).
$data= preg_replace('/[\x00-\x09\x0B-\x19\x7F]/', '', $data);
You want to check out PHP's utf_decode function: Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1. It seems you're getting UTF characters and the database is not able to handle those.
Another solution is to change the encoding of the database, if possible.
I finally came up with a routine for replacing these characters. It took parsing some of the problematic strings one character at a time and returning the octal value of each character. In doing so I learned that smart quote characters come back as sets of 3 octal values. Here is routine I used to parse the string:
$str = "string_with_smart_quote_chars";
$ilen = strlen($str);
$sords = NULL;
echo "$str\n\n";
for($i=0; $i<$ilen; $i++)
{
$sords .= ord(substr($str, $i, 1))." ";
}
echo "$sords\n\n";
Here are the str_replace() calls to "fix" the string:
$str = str_replace(chr(226).chr(128).chr(156), '"', $str); // start quote
$str = str_replace(chr(226).chr(128).chr(157), '"', $str); // end quote
$str = str_replace(chr(226).chr(128).chr(153), "'", $str); // for single quote
I am going to continue building up an array of these search/replacements which I am sure will continue to grow with the increasing use of these types of characters.
I know that there are some canned routines for replacing these but I had no luck with any of them on the Solaris 10 platform that my scripts are running on.
-- Nicholas
Zend Framework's Zend_Filter and Zend_Filter_Input has very good tools for this.

Categories