Purpose of using esc_url

Purpose of using esc_url - php

I don't understand why we need to use the esc_url if I myself am the one who actually wrote the URL like:
echo get_template_directory_url . '/someText'
Although the /someText is hardcoded but I know it's clean and safe because I wrote it. What are the circumstances that this will be unsafe (like how do bad guys do bad things when I don't use the esc_url in this case? Do they hack into the server? If they can really hack into the server, they won't even bother the esc_url already?
I have referred to https://stackoverflow.com/a/30583251/19507498 , but he just explain how we use it without explaining why we need it.

The purpose of this function is to replace spaces and special characters with their encoded url pendants. For example will be replace with %20. This is needed, because spaces and some other special characters like umlauts or ß are not allowed in urls.
EDIT:
Furthermore ? and & need to be encoded, because those have special meanings in urls.

Related

Strip Base64 strings from long text

I really wonder if I'm really the first one asking this question or am I so blind to finde some about this...
I have a longer text and I want to strip base64 encoded strings of it
I am a text and have some lines with some content
There are more than one line but sometimes I have
aSBhbSBhIG5vcm1hbCB0ZXh0IHRoYXQgd2FzIGNvZ
GVkIGluIGJhc2UgNjQgYW5kIG5vdyBpIHdhcyB0cmFu
c2xhdGVkIGJhY2sgdG8gYmxhbmsgdGV4dGZvcm1hd
C4gaSB0aGFuayB5b3UgZm9yIHBheWluZyBhdHRlbnRp
b24uIGJ5ZQ==
and this is what I want to strip / extract by using php
As you can see there is base64 encoded data in the text and I want to extract/strip these lines.
I allready tried a lot of regex samples from SO something like
$regex = '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#m';
preg_match($regex, $content, $output_array );
but this not solved anything...
What I need is a regex that only selects the base strings...
Is this even possible ? I mean is base64 selectable by regex ? I guess :)
EDIT: String-Source is the content of an email
EDIT2: Guess the best syntax for this case your be so track strings that has more than one uppdercased character and can have numbers and has no whitespaces. But regex is not my daily bread :D

First of all: You can not reliably do this!
Why?
Simple, the point why base64 is so great in some cases is, that is encodes all the data with "standard" characters. Those that are used in normal texts, sentences, and yes, even words.
Background
Is "Hello" a base64-encoded string? Well, yes, in the meaning of it is "valid base64 encoded". It probably returns a lot of jibberish, but it is a base64-ok string.
Therefore, you can only decide on a length after which you consider characters connected without any space to be base64 encoded. Of course in languages such as german you may have quite some trouble here, as there a compound nouns, such as "Bäckerfachverkäuferinnenhosenherstellungsautomatenzuliefererdienst" or such (just made that up).
Workaround
So on the length you have to decide yourself, an then you can go with this:
[a-zA-Z0-9\+\/\=]{20,}
Also see the example here: https://regex101.com/r/uK5gM1/1
I considered "20" to be the minimum length for "base64 encoded stuff" here, but as said, it is up to you. Also, as a small side note, the = is not really encoded content but fill bytes, but I still added it to the regex.
Edit: Gnah.. you can even see in my example that I did not catch the last line :) When changing the number to 12 it works fine here, but there may be words with more than 12 characters ... so - as said, not really reliably possible in this manner.

For the snippet in the example /^\w{53}$/gm does the job. If you can rely on length of course.
EDIT:
Considering circumstances and updates, I would go with /\n([\w=\n]{50,})\n/gs but without metadata it may be tricky to guess mime-type of the decoded stuff, and almost impossible to restore filenames etc.

Changing case with regex

I was looking for this for a while, but was not able to find any answer. I need to change a string to lowercase in PHP.
Off course, this can be done by using strtolower(), but I was wondering if its possible to do it via preg_replace().
I noticed that in vim one can use \L or \U modifiers in the back references to change the case to lower or upper.
Is something like that possible to do in PHP, i.e. in the second argument in preg_replace()? The reason why I wanna change the case via preg_replace() is that I heard that it might work better for UTF8 strings (not sure if its true).
Thanks.

You should actually just use
mb_strtolower($str, 'UTF-8')
That way you specify utf-8 is the encoding, and all should work well.
Edit: sorry had strtoupper, changed to lower. Also, you can leave off utf-8 and it should automatically detect the encoding and use the right one.

Doing with preg_replace is practically impossible.
This is because you need to pass the strtolower() / strtoupper() as a parameter to preg_replace function. Since preg_replace cannot act on their own.
Go with the function what Dave suggested.

unknown characters %252B in url

i have a page with links gotten from rss. they are:
broken link
http://news.asiaone.com/News/Latest%252BNews/Singapore/Story/A1Story20121220-390687.html
working link
http://news.asiaone.com/News/Latest%2BNews/Singapore/Story/A1Story20121220-390687.html
i realise it works by changing %252B to %2B. im using php. is there a way to detect and correct it on the run?

The URL has been double encoded. %25 is the escape sequence for "%", so a regular %2B got escaped again to %252B.
urldecode the value, but better avoid double-encoding it to begin with if possible.

Use "urldecode"
echo urldecode("http://news.asiaone.com/News/Latest%252BNews/Singapore/Story/A1Story20121220-390687.html");

Encoding of $_SERVER variables like $_SERVER['REQUEST_URI']

I need to do some string operations on $_SERVER value especially $_SERVER['REQUEST_URI']
How PHP encodes such strings? Should I use mb_* family functions?
To understand better my question, let's say I have a page on my webserver called like this:
ããã.php
And I need to get the second char:
echo mb_substr($_SERVER['REQUEST_URI'],1,1);

How PHP encodes such strings?
PHP won't do anything to the string, but the web browser will usually percent encode any non-ASCII characters in REQUEST_URI. (I say "usually" because I have seen IE not do it. I expect, however, Apache to do the job in that case - but I'm not entirely sure whether it will. You'd have to try out.)
Running urldecode() will decode those characters.
Related reading: Unicode characters in URLs

$_SERVER['REQUEST_URI'] usually doesn't contain stuff which would require usage of Multibyte functions so I guess if you don't tend to do unusual stuff with the request URIs such as stuffing them with non-ASCII characters (caused by weird URI rewrites etc.) you are safe to use the "normal" PHP string manipulation functions.

Correct character encoding

I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?

HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).

You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.

It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.

I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &amp;).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Purpose of using esc_url - php

Related

Strip Base64 strings from long text

Changing case with regex

unknown characters %252B in url

Encoding of $_SERVER variables like $_SERVER['REQUEST_URI']

Correct character encoding

Categories

Resources