I'd like to somehow obscure the contents of $url = "http://blah.somedomain.com/contents/somefolder/somefile.htm"; so I can use them for links but so that the URLs are not easily read by humans when looking at the page source. The obfuscated URL still needs to work in a browser when clicking on it though so other methods of obfuscation that I've looked at are no good.
What we're after is e.g. $obscureurl = "%3A%2F%2F"...etc
Any ideas? Thanks.
Edit: Thanks for suggestions so far, but to clarify, I should have said that I'm not after encoding into HTML entities (the # values), I'm after Percent-encoding (hex values in ASCII).
For example, to change hello#me.com into: %68%65%6c%6c%6f%40%6d%65%2e%63%6f%6d
ASCII table is here for the hex of each letter and symbol: http://ascii.cl/
Is this kind of complete conversion possible with PHP? Thanks
$url = '..';
$encoded = join(array_map(function ($byte) { return "%$byte"; }, str_split(bin2hex($url), 2)));
That's essentially the entire encoding mechanism. Take the raw bytes in hex (bin2hex), 2 characters per byte, and prepend a %.
Not that this will really do a whole lot for obfuscation. The browser may indeed not even display it in its encoded form, and even search engines may display only the decoded form. Further, you're still producing a canonical URL. It doesn't matter what exactly that URL contains; if people have a link to it, they have a link to it, regardless of how human readable that link may or may not be.
I can see 2 easy ways to achieve this:
Replace every character of your link by its html entity (see How to convert all characters to their html entity equivalent using PHP)
Use some kind of ids and save the matching url in your DB: (something like http://example.com/redirect/412)
Related
Generally, I would strip all characters that are not English using something like :
$file = filter_var($file, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH );
however, I am tired of not providing support for user input from other languages which may be in the form of an uploaded file (the filename may be in Cyrillic or Chinese, or Arabic, etc) or a form field, or even content from a WYSIWYG.
The examples for sanitizing data with regards to this, come in one of two forms
Those that strip all chars which are non-English
Those that convert all chars which are non-English to English letter substitutes.
The problem with this practice, is that you end up with a broken framework that pretends it supports multiple languages however it really doesn't aside from maybe displaying labels or content to them in their language.
There are a number of attacks which take advantage of unicode/utf-8/utf-16/etc support passing null bytes and so on, so it is clear that not sanitizing the data is not an option.
Is there any way to clean up a variable from arbitrary commands while maintaining the full alphabets/chars of these other languages, but stripping out (in a generic manner) all possible non-printable chars, chars that have nulls in them as part of the char, and other such exploits while maintaining the integrity of the actual characters the user input ? The above command is perfect and does everything exactly as it should, however it would be super cool if there were a way to expand that to allow support for all languages.
Null bytes are not(!) UTF-8, so assuming you use UTF-8 internally, all you need to do is to verify that the passed variables are UTF-8. There's no need to support UTF-16, for example, because you as author of the according API or form define the correct encoding and you can limit yourself to UTF-8. Further, "unicode" is also not an encoding you need to support, simply because it is not an encoding. Rather, Unicode is a standard and the UTF encodings are part of it.
Now, back to PHP, the function you are looking for is mb_check_encoding(). Error handling is simple, if any parameter doesn't pass that test, you reply with a "bad request" response. No need to try to guess what the user might have wanted.
While the question doesn't specifically ask this, here are some examples and how they should be handled on input:
non-UTF-8 bytes: Reject with 400 ("bad request").
strings containing path elements (like ../): Accept.
filename (not file path) containing path elements (like ../): Reject with 400.
filenames شعار.jpg, 标志.png or логотип.png: Accept.
filename foo <0> bar.jpg: Accept.
number abc: Reject with 400.
number 1234: Accept.
Here's how to handle them for different outputs:
non-UTF-8 bytes: Can't happen, they were rejected before.
filename containing path elements: Can't happen, they were rejected before.
filenames شعار.jpg, 标志.png or логотип.png in HTML: Use verbatim if the HTML encoding is UTF-8, replace as HTML entities when using default ISO8859-1.
filenames شعار.jpg, 标志.png or логотип.png in Bash: Use verbatim, assuming the filesystem's encoding is UTF-8.
filenames شعار.jpg, 标志.png or логотип.png in SQL: Probably just quote, depends on the driver, DB, tables etc. Consult the manual.
filename foo <0> bar.jpg in HTML: Escape as "foo <0> bar.jpeg". Maybe use " " for the spaces.
filename foo <0> bar.jpg in Bash: Quote or escape " ", "<" and ">" with backslashes.
filename foo <0> bar.jpg in SQL: Just quote.
number abc: Can't happen, they were rejected before.
number 1234 in HTML: Use verbatim.
number 1234 in Bash: Use verbatim (not sure).
number 1234 in SQL: Use verbatim.
The general procedure should be:
Define your internal types (string, filename, number) and reject anything that doesn't match. These types create constraints (filename doesn't include path elements) and offer guarantees (filename can be appended to a directory to form a filename inside that directory).
Use a template library (Moustache comes to mind) for HTML.
Use a DB wrapper library (PDO, Propel, Doctrine) for SQL.
Escape shell parameters. I'm not sure which way to go here, but I'm sure you will find proper ways.
Escaping is not a defined procedure but a family of procedures. The actual escaping algorithm used depends on the target context. Other than what you wrote ("escaping will also screw up the names"), the actual opposite should be the case! Basically, it makes sure that a string containing a less-than sign in XML remains a string containing a less-than sign and doesn't turn into a malformed XML snippet. In order to achieve that, escaping converts strings to prevent any character that is normally not interpreted as just text from getting its normal interpretation, like the space character in the shell.
I really wonder if I'm really the first one asking this question or am I so blind to finde some about this...
I have a longer text and I want to strip base64 encoded strings of it
I am a text and have some lines with some content
There are more than one line but sometimes I have
aSBhbSBhIG5vcm1hbCB0ZXh0IHRoYXQgd2FzIGNvZ
GVkIGluIGJhc2UgNjQgYW5kIG5vdyBpIHdhcyB0cmFu
c2xhdGVkIGJhY2sgdG8gYmxhbmsgdGV4dGZvcm1hd
C4gaSB0aGFuayB5b3UgZm9yIHBheWluZyBhdHRlbnRp
b24uIGJ5ZQ==
and this is what I want to strip / extract by using php
As you can see there is base64 encoded data in the text and I want to extract/strip these lines.
I allready tried a lot of regex samples from SO something like
$regex = '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#m';
preg_match($regex, $content, $output_array );
but this not solved anything...
What I need is a regex that only selects the base strings...
Is this even possible ? I mean is base64 selectable by regex ? I guess :)
EDIT: String-Source is the content of an email
EDIT2: Guess the best syntax for this case your be so track strings that has more than one uppdercased character and can have numbers and has no whitespaces. But regex is not my daily bread :D
First of all: You can not reliably do this!
Why?
Simple, the point why base64 is so great in some cases is, that is encodes all the data with "standard" characters. Those that are used in normal texts, sentences, and yes, even words.
Background
Is "Hello" a base64-encoded string? Well, yes, in the meaning of it is "valid base64 encoded". It probably returns a lot of jibberish, but it is a base64-ok string.
Therefore, you can only decide on a length after which you consider characters connected without any space to be base64 encoded. Of course in languages such as german you may have quite some trouble here, as there a compound nouns, such as "Bäckerfachverkäuferinnenhosenherstellungsautomatenzuliefererdienst" or such (just made that up).
Workaround
So on the length you have to decide yourself, an then you can go with this:
[a-zA-Z0-9\+\/\=]{20,}
Also see the example here: https://regex101.com/r/uK5gM1/1
I considered "20" to be the minimum length for "base64 encoded stuff" here, but as said, it is up to you. Also, as a small side note, the = is not really encoded content but fill bytes, but I still added it to the regex.
Edit: Gnah.. you can even see in my example that I did not catch the last line :) When changing the number to 12 it works fine here, but there may be words with more than 12 characters ... so - as said, not really reliably possible in this manner.
For the snippet in the example /^\w{53}$/gm does the job. If you can rely on length of course.
EDIT:
Considering circumstances and updates, I would go with /\n([\w=\n]{50,})\n/gs but without metadata it may be tricky to guess mime-type of the decoded stuff, and almost impossible to restore filenames etc.
I'm currently scraping a website for various pieces of textual data (with permission, of course). The issue I'm seeing is that certain characters aren't correctly encoded in the process. This is particularly prominent with apostrophes ('): leading to characters such as: .
Currently, I use the following code to convert various HTML entities from the scraped data:
htmlentities($content, ENT_COMPAT, 'UTF-8', FALSE)
Is there a better way to handle this sort of thing?
HTML entities have two goals:
Escape characters that have a special meaning in HTML, such as angle quotes, so they can be used as literals.
Display characters that are not supported by the character set you are using, such as the euro symbol in an ISO-8859-1 document.
They are not exactly an encoding tool.
If you want to convert from one charset into another one, I suggest you use iconv(). However, you must know both the source and the target charset. The source charset should be mentioned in the Content-Type response header and the target charset is something you decided when you started the site (although in your case it looks like UTF-8 is the most reasonable option).
You don't want to use htmlentities right away, I would use that on the data at the last point before you store it. One of the problems you'll run into is people don't always encode their entities properly anyway. Not everyone uses ™ they just copy the trademark in. If you put some logic in to try and grab whatever they put in and encode it properly you may be better off. For Example:
$patterns = array();
$patterns[0] = '/—/';
$patterns[1] = '/&nsbsp;/';
$patterns[2] = '/®/';
$replacements = array();
$replacements[2] = '&151;';
$replacements[1] = '&160;';
$replacements[0] = '&174;';
$ourhtml = preg_replace($patterns, $replacements, $html);
You could find all the "gotcha" characters like dashes and single quotes, apostrophes etc and encode them by hand, as well as use a set standard to the entities (text or numeric).
You could also use regular expressions to do the same thing, and would probably be a more elegant solution. But my suggestion would be to take some time filtering out what you don't want by hand, and then you know your data will be prepared exactly how you like.
It's a little bit difficult to suggest things based on the information provided. Can you provide an example snippet of text maybe?
Failing that, I'll employee the shotgun approach (e.g., suggesting a bunch of things and hoping one of them hits)
First of all, are you sure the page you're accessing is encoded in UTF-8? What does mb_detect_encoding say?
One option (may not work depending on your needs) would be to use iconv with the TRANSLIT option to convert the characters into something easier to handle using PHP. You could also look at using the mb_* functions for working with multibyte strings.
Are you sure htmlentities is the problem? If the content is UTF-8, and your site is set to serve ISO-8859-1, you're going to see odd characters. Check the encoding your browser is using to make sure it matches the encoding of the characters you're producing.
I don't see any issue with using htmlentities() as long as you pass false as the last parameter. This will ensure that you don't encode anything twice (such as turning & into &).
How to safely encode PHP string into alphanumeric only string?
E.g. "Hey123 & 5" could become "ed9e0333" or may be something better looking
It's not about stripping characters, its about encoding.
The goal is to make any string after this encoding suitable for css id string (alnum),
but later I will need to decode it back and get the original string.
bin2hex seems to fit the bill (although not as compact as some other encodings). Also take care that CSS ids cannot start with a number, so to be sure you'll need to prefix something to the bin2hex result before you have your final ID.
For the reverse (decoding), there's no such thing as hex2bin, but someone on the PHP documentation site suggested this (untested):
$bin_str = pack("H*" , $hex_str);
You can use BASE64 encoding
http://php.net/manual/function.base64-encode.php
This thread is dead for long time, but I was looking for solution to this problem and found this thread, someone might find the easy answer useful.
My solution is:
str_replace('=', '_', base64_encode($data));
I have a form served in non-UTF-8 (it’s actually in Windows-1251). People, of course, post there any characters they like to. The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities so I can still recognise them. For example, if user types an →, I receive an →. That’s partially great, like, if I just echo it back, the browser will correctly display the → no matter what.
The problem is, I actually do a htmlspecialchars () on the text before displaying it (it’s a PHP function to convert special characters to HTML entities, e.g. & becomes &). My users sometimes type things like — or ©, and I want to display them as actual — or ©, not — and ©.
There’s no way for me to distinguish an → from →, because I get them both as →. And, since I htmlspecialchars () the text, and I also get a → for a → from browser, I echo back an → which gets displayed as → in a browser. So the user’s input gets corrupted.
Is there a way to say: “Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself”?
Oh, I know that the good idea is to switch the whole software to UTF-8, but that is just too much work, and I would be happy to get a quick fix for this. If this matters, the form’s enctype is "multipart/form-data" (includes file uploader, so cannot use any other enctype). I use Apache and PHP.
Thanks!
The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities
Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘Б’ character.
I actually do a htmlspecialchars () on the text before displaying it
Yes. You must do that, or else you've got a security problem.
Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself
Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.
I know that the good idea is to switch the whole software to UTF-8,
Yup. Well, at least the encoding of the page containing the form should be UTF-8.
<form action="action.php" method="get" accept-charset="UTF-8">
<!-- some elements -->
</form>
All browsers should return the values in the encoding specified in accept-charset.
You check to see if the characters are within a certain range. If they fall outside the range of standard UTF-8 characters, do whatever you want to with it. I would do this by looking at each character &, #, 8, 5, 9, 4, and parsing it into something you can apply something to.
Short of finding somewhere where someone has created a Windows-1251 to UTF-8 conversion script, you are probably going to have to roll your own. You are probably going to have to look at each specific character and see what needs to be done with it. If it's something like © you will want to handle it differently than → because the second one has the # in it.
I think this answers your question.
The html_entity_decode function is probably what you want.
You could set the fourth parameter of the htmlspecialchars function (double_encode, since PHP 5.2.3) to false do avoid the character references being encoded again.
Or you first decode those existing character references.
You can convert the strings to UTF-8 using the PHP multi-byte functions. From there you can do as you wish. Especially the mb_convert_encoding() to move it from windows-1251 to UTF-8, or where ever.
I don't quite understand your question though, because if someone enters & as a text string, when you do the htmlspecialchars() that should convert it to & ... which when ran back through a html_entity_decode() would come out as the text string the user entered.
This is of course if you haven't used the double_encode option when running your string through the htmlspecialchars()
mbstring supports the "charset" HTML-Entities
for($i=0; $i<strlen($out); $i++) {
printf('%02X ', ord($out[$i]));
}61 20 E2 86 92 20 62 20 26 20 63 E2 86 92 is the byte-sequence for → (RIGHTWARDS ARROW) in utf8.
You won't be able to distinguish between the browser converting a codepoint to an entity and your users typing in an entity because they look identical. The real solution is to give up on Windows 1251. Instead, serve the webpage and form in UTF-8, ask for UTF-8 encoding and all these problems should just go away.