remove invalid chars from html document - php

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.

Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';

Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Related

How to replace a symbol in a text string in PHP?

I want to do a search & replace in PHP with a symbol.
This is the symbol: ➤
I want to replace it with a dash, but that doesn't work. The problem looks like that the symbol cannot be found, even though it's there.
Other 'normal' search and replace operations work as expected. But replacing this symbol does not.
Any ideas how to address this symbol, so that the search and replace function actually can find it and replace it?
Your problem is (almost certainly) related to text/character encoding.
Special characters such as the ➤ you are referring to, are not part of the classical ISO-8859-1 character set; they are however part of Unicode family (codepoint U+27A4 to be exact). This means that, in order to use this (multibyte)character, you have to use a unicode character set, which generally means UTF-8.
All the basic characters (think A-Z, numbers, spaces, ...) overlap between UTF-8 and ISO-8859-1 (which is effectively the default character set), so when you don't use any special characters, you could use the wrong charset and things will pretty much continue to work just fine; that is until you try to use a character that is not part of the basic set.
Since your problem takes place entirely on the server side (inside PHP), and doesn't really touch upon the HTTP and HTML layers, we won't have to go into utf-8 content-type headers and the like, but you should be aware of them for future issues (if you weren't already).
The issue you have should be resolved once you meet 2 criteria:
Not all PHP functions are multibyte-aware; I'm not 100% sure, but i think str_replace is one of those which is not. The preg_replace function with its u flag enabled definitely is multibyte aware, and can serve the exact same function.
The text editor or IDE that you used to create the .php file may or may not be set to UTF-8 encoding, if it wasn't then you should switch that in order to be able to use such characters literally inside the source code.
Something like this should function correctly assuming the .php-file is stored in UTF-8 format:
$output = preg_replace('#➤#u', '-', $input);
Most likely you did not set the header of your PHP script to use the UTF-8 character set. Consider the following:
header('Content-type: text/plain; charset=utf-8');
$input = "This is the symbol: ➤";
$output = str_replace("➤", "-", $input);
echo $input . "\n" . $output;
This prints:
This is the symbol: ➤
This is the symbol: -
as that is simply replaceable using builtin php str_replace function, so that would be better if you can share us your code to check it more.
$str = "hey same let's change this to a dash: ➤";
echo "before: $str \n";
echo "after: ".str_replace("➤", "-", $str);
before: hey same let's change this to a dash: ➤
after: hey same let's change this to a dash: -
example

Remove certain special HTML characters from string in PHP

I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&amp". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. Ω or ) as well as decimal (e.g. &#01234) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.

PHP trim and space not working

I have some data imported from a csv. The import script grabs all email addresses in the csv and after validating them, imports them into a db.
A client has supplied this csv, and some of the emails seem to have a space at the end of the cell. No problem, trim that sucker off... nope, wont work.
The space seems to not be a space, and isn't being removed so is failing a bunch of the emails validation.
Question: Any way I can actually detect what this erroneous character is, and how I can remove it?
Not sure if its some funky encoding, or something else going on, but I dont fancy going through and removing them all manually! If I UTF-8 encode the string first it shows this character as a:
Â
If that "space" is not affected by trim(), the first step is to identify it.
Use urlencode() on the string. Urlencode will percent-escape any non-printable and a lot of printable characters besides ASCII, so you will see the hexcode of the offending characters instantly. Depending on what you discover, you can act accordingly or update your question to get additional help.
I had a similar problem, also loading emails from CSVs and having issues with "undetectable" whitespaces.
Resolved it by replacing the most common urlencoded whitespace chars with ''. This might help if can't use mb_detect_encoding() and/or iconv()
$urlEncodedWhiteSpaceChars = '%81,%7F,%C5%8D,%8D,%8F,%C2%90,%C2,%90,%9D,%C2%A0,%A0,%C2%AD,%AD,%08,%09,%0A,%0D';
$temp = explode(',', $urlEncodedWhiteSpaceChars); // turn them into a temp array so we can loop accross
$email_address = urlencode($row['EMAIL_ADDRESS']);
foreach($temp as $v){
$email_address = str_replace($v, '', $email_address); // replace the current char with nuffink
}
$email_address = urldecode($email_address); // undo the url_encode
Note that this does NOT strip the 'normal' space character and that it removes these whitespace chars from anywhere in the string - not just start or end.
Replace all UTF-8 spaces with standard spaces and then do the trim!
$string = preg_replace('/\s/u', ' ', $string);
echo trim($string)
This is it.
In most of the cases a simple strip_tags($string) will work.
If the above doesn't work, then you should try to identify the characters resorting to urlencode() and then act accordingly.
I see couples of possible solutions
1) Get last char of string in PHP and check if it is a normal character (with regexp for example). If it is not a normal character, then remove it.
$length = strlen($string);
$string[($length-1)] = '';
2) Convert your character from UTF-8 to encoding of you CSV file and use str_replace. For example if you CSV is encoded in ISO-8859-2
echo iconv('UTF-8', 'ISO-8859-2', "Â");

Cakephp sanitizing and special characters

i'm using sanitize::paranoid on a string but i need to exclude a few special characters but it doesn't seem to work.
$content=sanitize::paranoid($content,array('à',' '));
I've changed the encoding of my file from ansi to utf8 but cakephp doesn't really like it so i need to find another way.
That array should contain the list of characters to exclude from sanitization, but it keep removing the "à" and i want those character in the final string.
Sanitize:paranoid is a simple preg_replace ($allow is just additional characters, escaped):
preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
As you can see, paranoid is quite paranoid... doesn't accept non-ascii letters by default.
The file where you had the à was probably saved in another encoding (working on windows?)
Anyway, if you want you can write a better filter by using /[^\p{L}]/u, which excludes letters in any lanaguage.
Taken from the Sanitize::paranoid function:
cleaned = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
Because your character (à) is not in this range it will not be returned.
If you're using Cake 2.x you can override the Sanitize class in your app folder
and replace all occurrences of:
a-zA-Z0-9
with:
\w
This should return the accented character (it does for me). You can also look at the
multibyte functions if you like but that might be a problem if you're building a CMS.
it must be some special encoding problems that cakephp paranoid doesnt know
Sanitize::paranoid($badString, array(' ', '#')); # is the allowed char
it should be working. i tried this example myself

Weird utf8 conversion problem in php

So I'm working on a project that is taking data from a file, in the file some lines require utf8 symbols but are encoded oddly, they are \xC6 for example rather than being \Æ
If I do as follows:
$name = "\xC6ther";
$name = preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
echo utf8_encode($name);
It works fine. I get this:
Æther
But if I pull the same data from MySQL, and do as follows:
$name = $row['OracleName'];
$name = preg_replace('/x([a-fA-F0-9]{2})/', '\&#$1;', $name);
$name = utf8_encode($name);
Then I receive this as output:
\&#C6;ther
Anyone know why this is?
As requested, vardump of $row['OracleName'];
string(15) "xC6ther Barrier"
on your second preg_replace why there is a \
preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
ok I think there is some confusion here. you regular expression is matching something like x66 and would replace that by '&#66', which seems to be some html entities encoding to me but you are using utf8_encode which do that (from manual):
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
so the things would never get converted ... (or to be more precise the '&#66' would remains '&#66' since they are all same characters in ISO-8859-1 and UTF-8)
also to be noted on your first snippet you use \xC6 but this would never get caught by the preg_replace since it's already encoded character. The \x means the next hex number (0x00 ~ 0xFF) would be drop in the string as is. it won't make a string xC6
So I am kind of confused of what you really wanna do. what the preg_replace is all about?
if you want to convert HTML entities to UTF-8 look into mb_convert_encoding (manual), if you want to do the reverse, code in HTML entities from some UTF-8 look into htmlentities (manual)
and if it has nothing to do with all of that and you want to simply change encoding mb_convert_encoding is still there.
Figured out the problem, on the SQL pull I missed an 'x' in the preg_replace
preg_replace('/x([a-fA-F0-9]{2})/', '&#x$1;', $name);
Once I added in the x, it worked like a charm.

Categories