I've been trying to post some data with CURL. The http_build_query may be the problem in my case.
I have a long list of information to post, like data mixed with data in arrays.
It looks like, when I do the http_build_query it returns an URL full of stange symbols, like this:
contact[person]=testes83¶ms[make]=ford¶ms[model]=focus¶ms[version]=mk1-1998-2004
Which, in my opinion, causes errors when the server tries to do something with it.
Also I have a word that stars with "re", after http_build_query, it's transformed to ®
'region_id' => 1,
Transforms into
'®gion_id' => 1,
Also here's the http_build_query that I'm using
http_build_query($car_info,'', '&');
I bet you put the string returned by http_build_quety() directly into HTML, without properly encoding the HTML entities.
As per the HTML standard, there are four characters (<, >, & and ") that should always be properly encoded using their entities representation when they are used in HTML as normal characters:
< must be encoded as <;
> as >;
& as &;
" as ".
An HTML character entity always starts with & and it should end with ;. The ending ; is optional and the browsers can successfully recognize character entities that doesn't end with ; when they are followed by some characters (white space characters, quotes, dot, comma, < etc).
But when the ending ; is missing, the browser is allowed to try to recognize an incomplete character entity or ignore it and consider & represents itself.
The string produced by html_build_query() is:
contact[person]=testes83¶ms[make]=ford®ion_id=1
it interprets ¶ from ¶ms as ¶ (it should be ¶);
it interprets ® from ®ion as ® (it should be ®).
The browser is right!
Your HTML is invalid and when this happens the browser is allowed to correct it as it pleases!
As #Álvaro González points out in a comment (thank you!), currently all major browsers recognize character entities when they don't end with ; and are followed by other letter characters (as it happens in URLs).
You must always use htmlentities() or at least htmlspecialchars() to properly encode any string you build dynamically before throwing it as text in HTML. This includes the URLs, even when they are used as values for href or src HTML attributes.
Related
I need to make up a simple string in PHP which is a string of data to be posted to another site.
The problem is that one of the fields is 'notify_url=..' and when I use that PHP takes the & in front of it and the not part to mean the logical operator AND NOT and converts it to a ¬ character:
$string = 'field1=1234&field2=this¬ify_url=http';
prints as 'field1=1234&field2=this¬ify_url=http'
The encoding on my page is UTF-8.
I have tried creating the string with single quotes as well as double quotes. I have tried making the fields names variables and concating them in but it always products the special character.
This is not being urlencoded because the string is meant to be hashed before the form is submitted to verify posted data.
PHP isn't doing that, it's your browser interpreting HTML entity notation. & has a special meaning in HTML as the start of an HTML entity, and ¬ happens to be a valid HTML entity. You need to HTML-encode characters with special meanings:
echo htmlspecialchars($string);
// field1=1234&field2=this¬ify_url=http
I was wondering if anybody knew how to get around this problem.
I am gathering user input from a HTML form which is then posted using htmlspecialchars into PHP to avoid issues when using quotes/etc...
However, I also want to run server-side validation checks on the data being gathered through regular expressions - though I'm not sure how to go about this.
So far, I have thought of decoding the htmlspecialchars - but because I am going to be using the Strings straight away, this means that the code could break after I run this conversion. e.g: Let's say the user inputted a single quote, " into a field. This would be converted to ", then if I decode this and use it in a variable, it could end up like: $string = """; which is going to give me issues.
Any advice on this would be greatly appreciated!
You seem to misunderstand the difference between data and how this data is altered to be parseable in a certain context.
A php string can contain any data. What is stored in this string is the "raw" form: the form in which we want to manipulate the data if needed.
In certain contexts, not all characters are valid. For example, in a html textarea, the < and > characters may not be used, because they are special characters. We still want to be able to use these characters. To use special characters in a context, we escape these characters. By escaping a special character it looses its special meaning. In the context of a html textarea, the < character is escaped as the sequence <. Unlike the < character, this escaped sequence does not have a special meaning in html, and thus if we send the following sequence to the browser, it knows how to parse that sequence and display the right thing: <textarea><</textarea>. When we talk about what the data is that this textarea contains, we do not say that it contains <, but instead we say that it contains <.
As you said, in a php script, in a double quoted string, the " character has a special meaning. This has only to do with parsing. PHP simply does not know how to parse a sequence $str = """;. If we would want to have the double quote in such a double quoted string, we would need to escape it. We escape a double quote in a php double quoted string by prepending it with a \. To make a string containing a single double quote, using the double quoted notation, you would write $str = "\"";.
However, none of this matters.. You are taking input from a html form. When you click the submit button, the browser reads what is in the textarea(, and decodes it as html?). The browser then encodes it in a way as dictated by the form tag, and sends it to the server. The server then decodes the blob of text back in it's raw data form. That data is passed to PHP, and it is this form you will encounter in $_POST['myTextarea'].
In conclusion: If data is encoded, realize for which context it was encoded and decode it based on that context. You do not need to escape for php quoted strings, because you are working on internal strings. There is nothing to parse. Remind yourself that when you are going to use the data somewhere, that you should take care that all special characters in your data for that particular context are escaped.
I suppose that htmlspecialchars() function is called after posting the form to PHP. Simplest solution then will be to match against regular expression first and then do htmlspecialchars().
Also, if you have string encoded with htmlspecialchars(), after decoding with htmlspecialchars_decode(), PHP internal representation will be "\"", so you break nothing. There is big difference how you write strings by hand to PHP file and how PHP internally handle them. You really don't need to be bothered by this.
I store codes like "\u1F603" within messages in my database, and now I need to display the corresponding emoji on my web page.
How can I convert \u1F603 to \xF0\x9F\x98\x83 using PHP for displaying emoji icons in a web page?
You don't need to convert emoji character codes to UTF-8 sequences, you can simply use the original 21-bit Unicode value as numeric character reference in HTML like this: 😃 which renders as: 😃.
The Wikipedia article "Unicode and HTML" explains:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 合, which produces this: 合.
So if in your PHP code you have a string containing '\u1F603', then you can create the corresponding HTML string using preg_replace, as in following example:
$text = "This is fun \\u1F603!"; // this has just one backslash, it had to be escaped
echo "Database has: $text<br>";
$html = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $text);
echo "Browser shows: $html<br>";
This outputs:
Database has: This is fun \u1F603!
Browser shows: This is fun 😃!
Note that if in your data you would use the literal \u notation also for lower range Unicode characters, i.e. with hex numbers of 2 to 4 digits, you must make sure the next user's character is not also a hex digit, as it would lead to a wrong interpretation of where the \u escape sequence stops. In that case I would suggest to always left-pad these hex numbers with zeroes in your data so they are always 5 digits long.
To ensure your browser uses the correct character encoding, do the following:
Specify the UTF-8 character encoding in the HTML head section:
<meta charset="utf-8">
Save your PHP file in UTF-8 encoding. Depending on your editor, you may need to use a "Save As" option, or find such a setting in the editor's "Preferences" or "Options" menu.
Hell everyone,
after many try i can found solution.
I user below code:
https://github.com/BriquzStudio/php-emoji
include 'Emoji.php';
$message = Emoji::Decode($message);
This one working fine for me!! :)Below is my reslut
I have a problem with a hidden character not showing up either in database (phpmyadmin) nor website. Website has utf-8 character encoding. If I copy/paste the string with the "hidden" character into Notepad I can see it. It looks just like a bullet character but is hidden. What type of character is this and can it be removed with PHP?
The user able to type this character is using Mac and are probably doing a copy/paste from a document (maybe unicode?) into a form on our website and saves it. So this character is not visible with utf-8 encoding but visible if I copy my string into a Notepad document.
This is the hidden character at the end of the string. Looks like a bullet:
Copy the character, then fire up PowerShell and do the following (yes, it's convoluted, sorry):
'U+{0:X4}'-f+[char]'<PASTE>'
and paste the character where it says <PASTE>. It should give you the Unicode code point of that character. You then should be able to write something that removes it from the string, but from my eyes there shouldn't be any input that destroys the document layout, except maybe fun things like RTL markers.
Short explanation of the above: [char]'x' converts a single-character string to a char, + will then treat it as a number (similar to [int], but shorter). The rest is a format string and the formatting operator -f.
curl downloads http://mysite.com/Lunacy%20Disc%202%20of%202%20(U)(Saturn).zip
but not
http://mysite.com/Lunacy Disc 2 of 2 (U)(Saturn).zip
Why is this the case?
Do I need to convert it to the first format ?
using the URL generated via urlencode($url) fails.
Two problems:
urlencode will also encode the slashes on you. It's meant to encode query strings for use in urls, not full urls.
urlencode encodes spaces as +. You need rawurlencode if you want spaces as %20.
To convert an URL to the "first format", you can use the PHP function urlencode.
Now, for the "why", the answer can probably be found in the RFC 1738 - Uniform Resource Locators (URL).
Quoting some paragraphs :
Octets must be encoded if they have no corresponding graphic
character within the US-ASCII coded character set, if the use of the
corresponding character is unsafe, or if the corresponding character
is reserved for some other interpretation within the particular URL
scheme.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
control characters; these must be encoded.
A space has the code %20 -- it's not in the range 00-1F, so it should be encoded for that reason... But, a bit later :
Unsafe:
Characters can be unsafe for a number of reasons. The space
character is unsafe because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
And here, you know why the space character has to be escaped/encoded too ;-)
urlencode() does indeed fail with curl, if your problem is just with spaces, you can manually substitute them
$url = str_replace(' ', '%20', $url);
You need to urlencode to translate the spaces (in your example; there are other characters that require it) for transmission across the internet. The encoding ensures that the various communications protocols don't terminate or otherwise mangle the string while they're handling it.
http://mysite.com/Lunacy Disc 2 of 2 (U)(Saturn).zip
That is not a valid url. Accessing urls like this may work in your browser because most modern browsers will automatically encode the url for you if required. The curl library must not do this automatically.
Why? Because some characters has special meanings such as # (html anchor).
So all characters except alfanumeric ones are encoded regardless need to be encoded or not.