php is incorrectly converting &not in strings to ¬

php is incorrectly converting &not in strings to ¬ - php

I need to make up a simple string in PHP which is a string of data to be posted to another site.
The problem is that one of the fields is 'notify_url=..' and when I use that PHP takes the & in front of it and the not part to mean the logical operator AND NOT and converts it to a ¬ character:
$string = 'field1=1234&field2=this&notify_url=http';
prints as 'field1=1234&field2=this¬ify_url=http'
The encoding on my page is UTF-8.
I have tried creating the string with single quotes as well as double quotes. I have tried making the fields names variables and concating them in but it always products the special character.
This is not being urlencoded because the string is meant to be hashed before the form is submitted to verify posted data.

PHP isn't doing that, it's your browser interpreting HTML entity notation. & has a special meaning in HTML as the start of an HTML entity, and &not happens to be a valid HTML entity. You need to HTML-encode characters with special meanings:
echo htmlspecialchars($string);
// field1=1234&field2=this&notify_url=http

Related

Output PHP string to show escaped characters

In PHP, is it at all possible to output the contents of a string to show any escaped characters that may be contained within the string? I get that the whole point of escaping characters is so that they aren't treated in the usual way. But I would still like to be able to view the raw contents of a string so I can see for myself exactly how characters like \n and \r, etc. are represented. Does PHP have a method for doing this?

Use json_encode() to encode the string as JSON. The JSON encoding of strings (which is, in fact, JavaScript) is the same as the one used by PHP. Both JavaScript and PHP were inspired from C and they copied the notation of string literals from it.

if you use single quotation marks it should do what you need
eg echo 'this\n'; will output this\n where as echo "this\n"; will output this and a new line

PHP http_build_query returns symbols

I've been trying to post some data with CURL. The http_build_query may be the problem in my case.
I have a long list of information to post, like data mixed with data in arrays.
It looks like, when I do the http_build_query it returns an URL full of stange symbols, like this:
contact[person]=testes83¶ms[make]=ford¶ms[model]=focus¶ms[version]=mk1-1998-2004
Which, in my opinion, causes errors when the server tries to do something with it.
Also I have a word that stars with "re", after http_build_query, it's transformed to ®
'region_id' => 1,
Transforms into
'®gion_id' => 1,
Also here's the http_build_query that I'm using
http_build_query($car_info,'', '&');

I bet you put the string returned by http_build_quety() directly into HTML, without properly encoding the HTML entities.
As per the HTML standard, there are four characters (<, >, & and ") that should always be properly encoded using their entities representation when they are used in HTML as normal characters:
< must be encoded as <;
> as >;
& as &;
" as ".
An HTML character entity always starts with & and it should end with ;. The ending ; is optional and the browsers can successfully recognize character entities that doesn't end with ; when they are followed by some characters (white space characters, quotes, dot, comma, < etc).
But when the ending ; is missing, the browser is allowed to try to recognize an incomplete character entity or ignore it and consider & represents itself.
The string produced by html_build_query() is:
contact[person]=testes83&params[make]=ford&region_id=1
it interprets &para from &params as ¶ (it should be ¶);
it interprets &reg from &region as ® (it should be ®).
The browser is right!
Your HTML is invalid and when this happens the browser is allowed to correct it as it pleases!
As #Álvaro González points out in a comment (thank you!), currently all major browsers recognize character entities when they don't end with ; and are followed by other letter characters (as it happens in URLs).
You must always use htmlentities() or at least htmlspecialchars() to properly encode any string you build dynamically before throwing it as text in HTML. This includes the URLs, even when they are used as values for href or src HTML attributes.

Using Regular Expressions with user input in PHP

I was wondering if anybody knew how to get around this problem.
I am gathering user input from a HTML form which is then posted using htmlspecialchars into PHP to avoid issues when using quotes/etc...
However, I also want to run server-side validation checks on the data being gathered through regular expressions - though I'm not sure how to go about this.
So far, I have thought of decoding the htmlspecialchars - but because I am going to be using the Strings straight away, this means that the code could break after I run this conversion. e.g: Let's say the user inputted a single quote, " into a field. This would be converted to ", then if I decode this and use it in a variable, it could end up like: $string = """; which is going to give me issues.
Any advice on this would be greatly appreciated!

You seem to misunderstand the difference between data and how this data is altered to be parseable in a certain context.
A php string can contain any data. What is stored in this string is the "raw" form: the form in which we want to manipulate the data if needed.
In certain contexts, not all characters are valid. For example, in a html textarea, the < and > characters may not be used, because they are special characters. We still want to be able to use these characters. To use special characters in a context, we escape these characters. By escaping a special character it looses its special meaning. In the context of a html textarea, the < character is escaped as the sequence <. Unlike the < character, this escaped sequence does not have a special meaning in html, and thus if we send the following sequence to the browser, it knows how to parse that sequence and display the right thing: <textarea><</textarea>. When we talk about what the data is that this textarea contains, we do not say that it contains <, but instead we say that it contains <.
As you said, in a php script, in a double quoted string, the " character has a special meaning. This has only to do with parsing. PHP simply does not know how to parse a sequence $str = """;. If we would want to have the double quote in such a double quoted string, we would need to escape it. We escape a double quote in a php double quoted string by prepending it with a \. To make a string containing a single double quote, using the double quoted notation, you would write $str = "\"";.
However, none of this matters.. You are taking input from a html form. When you click the submit button, the browser reads what is in the textarea(, and decodes it as html?). The browser then encodes it in a way as dictated by the form tag, and sends it to the server. The server then decodes the blob of text back in it's raw data form. That data is passed to PHP, and it is this form you will encounter in $_POST['myTextarea'].
In conclusion: If data is encoded, realize for which context it was encoded and decode it based on that context. You do not need to escape for php quoted strings, because you are working on internal strings. There is nothing to parse. Remind yourself that when you are going to use the data somewhere, that you should take care that all special characters in your data for that particular context are escaped.

I suppose that htmlspecialchars() function is called after posting the form to PHP. Simplest solution then will be to match against regular expression first and then do htmlspecialchars().
Also, if you have string encoded with htmlspecialchars(), after decoding with htmlspecialchars_decode(), PHP internal representation will be "\"", so you break nothing. There is big difference how you write strings by hand to PHP file and how PHP internally handle them. You really don't need to be bothered by this.

Converting Symbols (Copyright, Reg etc)

I am trying to sanitise database input and found a problem with the Ⓡ character.
Ⓡ converts to
Ⓡ
Even with html_entity_decode around the variable.
This is a problem because the field is only meant to allow 4 characters in the database.
® Actually works though and is treated as a single character.
I have the same problem with Ⓒ vs ©.
As far as I know they are just html entities so should be decoded. However they aren't even encoded with htmlspecialchars(). It just echoes out the code
Ⓡ
Does PHP have any built-in functions to solve this? Thanks
Edit just to say what I am trying to do:
I have text fields to input and add to a database which displays in a table below.
When I enter any other character like < > &, it enters straight into the database as one character.
I am trying to make Ⓡ and Ⓒ always go in as one character as well (instead of 6).
I am only encoding on output in the table so certain characters don't break the website.

The problem that the entity doesn't decode when using html_entity_decode is likely that the target character set given to html_entity_decode is still the default ISO-8859-1. ISO-8859-1 cannot encode "Ⓡ" (the CIRCLED LETTER R), but it can encode "®" (the REGISTERED MARK).
So, first, to decode it correctly:
html_entity_decode('Ⓡ', ENT_COMPAT, 'UTF-8')
But secondly, "Ⓡ" and "®" are not the same character, and you probably don't want "Ⓡ".

what is the name of this type of encode?

If you copy and paste the following text in a html page,
انوان
you will the following Arabic text:
انوان
My question is:
What is the name of this type of encoding that include numbers and hash (#) sign, and how decode it in PHP?

These are... HTML entities (or "Numeric character references" for the nitpickers).
Try html_entity_decode.
Example:
$foo = html_entity_decode('انوان');
// gives you the arabic words in $foo
(If the string is in the form &#1575;... you need to apply html_entity_decode twice. (I don't know if codaddict's edit is valid.))

These characters are known as HTML entities. Basically, they're a safer way of representing characters such as & and other symbols that might have meanings in HTML. All characters have a corresponding HTML entity.
You can decode them in PHP by using html_entity_decode

You can use the convert_uudecode() function for decode.
<?php
echo convert_uudecode("+22!L;W9E(%!(4\"$`\n`"); //It prints I love PHP!
echo "\n";
echo convert_uudecode('انوان'); //It prints WU±
?>

To use proper terminology:
& is an entity reference that references the entity named amp.
ا is a character reference that references the character U+0627 (1575 in decimal) in the Unicode character set.
Both references are character references as they only reference single characters. But entities can also represent more than just a single character.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.