XML Non Breaking White Space - php

I think the cause of my woes at present is the non-breaking white space.
It appears some nasty characters have found their way into our MySQL database from our back office systems. So as I'm trying to run an XML output using PHP's XMLWriter, but there's loads of these silly characters getting into the field.
They're displayed in nano as ^K, in gedit as a weird square box, and when you delete them manually in MySQL they don't take up a phsyical space, despite that you know you've deleted something.
Please help me get rid of them!
Here is the line that is the nightmare at present (i've skipped out the rest of the XMLWriter buildup).
$writer->writeElement("description",$myitem->description);

After you have identified which character specifically you want to remove (and it's binary sequence), you can just remove it. For example with str_replace:
$binSequence = "..."; // the binary representation of the character in question
$descriptionFiltered = str_replace($binSequence, '', $myitem->description);
$writer->writeElement("description", $descriptionFiltered);
You have not specified yet about which concrete character you're talking, so I can't yet specify the binary sequence. Also if you're talking about a group of characters, the filtering might vary a bit.

Seems that they are vertical tabs, ASCII x0B. You should be able to REPLACE them in MySQL:
SELECT REPLACE('\v', '', `value`) WHERE key = 'foo';
However, the official reference doesn't mention \v specifically. If it doesn't work, you can remove it afterwards in PHP with a simple str_replace (since PHP 5.2.5):
str_replace("\v", '', $result);

Related

PHP remove terminal codes from string

While processing the input/output of a process created with proc_open, I've been hit with the special terminal ANSI codes (\033[0J,\033[13G), aside from not finding a reference to what these particular codes are doing, they are really messing with my preg_match calls.
Does PHP have a built in method for cleansing these types of strings? Or what would be the correct expression to use with preg_replace? Please note, I am dealing with non ascii characters, so stripping everything except... will not work.
Usually ANSI codes are introduced by an ESC (\033 aka \x1b), an open square bracket, then numbers (possibly repeated: *[32;40m) and terminated by a letter.
You can use something like #\\x1b[[][0-9]+(;[0-9]*)[A-Za-z]# to preg_replace them all to oblivion.
This works (just tested), even if definitely overkill:
$test = preg_replace('#\\x1b[[][^A-Za-z]*[A-Za-z]#', '', $test);
I've also found this on GitHub, and this on SO.

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

PHP - preg_match() - matching substitution character black diamond with question mark

I have a problem with substitution character - diamond question mark � in text I'm reading with SplFileObject. This character is already present in my text file, so nothing can't be done to convert it to some other encoding. I decided to search for it with preg_match(), but the problem is that PHP can't find any occurence of it. PHP probably sees it as different character as �. I don't want to just remove this character from text, so that's the reason I want to search for it with preg_match(). Is there any way to match this character in PHP?
I tried with regex line: /.�./i, but without success.
Try this code.Hexadecimal of � character is FFFD
$line = "�";
if (preg_match("/\x{FFFD}/u", $line, $match))
print "Match found!";
PHP with SplFileObject seems to read the file a little bit different and instead of U+FFFD detects U+0093 and U+0094. If you are having the same problem as I had, then I suggest you to use hexdump to get information on how unrecognized character is encoded in it. Afterwards I suggest you to use this snippet as recommended by #stribizhev in comments, to get hex code recognized by PHP. Once you figure out what is correct hex code of unrecognized character (use conversion tool as suggested by #stribizhev in comments, to get correct value), you can use preg_...() function. Here's the solution to my problem:
preg_replace("/(?|\x93|\x94)/i", "'", $text);

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

PHP Unicode character questions

Here's a link I found, which even has a character I need to play with for other projects of mine.
http://www.fileformat.info/info/unicode/char/2446/index.htm
There is a box with the Title of: "Encodings" on that page. And I am wondering about some of the rows.
I obviously need a course on this sort of thing, but I'm wondering what the difference is between "HTML Entity (decimal)" and "HTML Entity (hex)".
The funny thing is, which confuses me, I throw those characters on a web page, and they display fine. But I haven't specified any UTF-8 encoding in the php page.
<?php
$string1 = '⑆';
$string2 = '⑆';
echo $string1;
echo '<br>';
echo $string2;
?>
Does the browser know how to display both automatically?
And to make it weirder, I can only see those characters on my Mac, in Firefox.
But my windows box doesn't want to show them. I've tested it in chrome, and firefox. Do I need to tell the browsers to view them correctly? Or is it an operating system modification?
They're both valid numeric HTML entities, and the browser does indeed know how to decode them. The difference is the first is a hexadecimal number, while the latter is decimal.
0x2446 = 9286
Note that 0x means hexadecimal.
Also note that it is good practice to always have your server explicitly specify an encoding. The W3C explains how to do so. UTF-8 is a good choice.
If you use any Unicode encoding, you can always put the character right on your page, so you don't have to use entities.
To be exact, neither is an entity reference. & is an entity reference that refers to the entity named amp that is defined as:
<!ENTITY amp CDATA "&" -- ampersand, U+0026 ISOnum -->
Here you can see that the entity’s value is just another reference: &.
⑆ and ⑆ are “just” character references (numeric character references to be exact) and refers to characters by specifying the code position of a character in the Universal Character Set, i.e. the Unicode character set.
You can use any "HTML Entity" in any encoding and in practice, if You have installed appropriate fonts, every browser will work fine. Well, it was created for displaying characters that are not included in current encoding. In Your situations it looks You have to install some fonts on Your Windows box.
On the other hand, it has almost nothing to do with PHP.

Categories