I have some inherited code whose purpose is to identify urls in a string an prepend the http:// protocol onto them if it doesn't exist.
return preg_replace_callback(
'/((https?:\/\/)?\w+(\.\w{2,})+[\w?&%=+\/]+)/i',
function ($match) {
if (stripos($match[1], 'http://') !== 0 && stripos($match[1], 'https://') !== 0) {
$match[1] = 'http://' . $match[1];
}
return $match[1];
},
$string);
It's working, except when a domain has a hyphen it. So, for-instance, the following string will only partially work.
$string = "In front mfever.com/1 middle http://mf-ever.com/2 at the end";
Can any regex genius see what's wrong with it?
You just need to add the optional dash:
((https?:\/\/)?\w+\-?\w+(\.\w{2,})+[\w?&%=+\/]+)
See it work here https://regex101.com/r/Tkdapj/1
I'm writing a little class to minify JavaScript with PHP. I have the following problematic code in my class:
private function test_opener($str, $i) {
if(ord($str[$i]) === 34 or ord($str[$i]) === 39)
{
if($this->_is_string_opened)
{
if($this->_string_opener === $str[$i] and ! $this->is_escaped($str, $i))
{
$this->_is_string_opened = false;
$this->_string_opener = null;
}
}
else
{
$this->_is_string_opened = true;
$this->_string_opener = $str[$i];
}
}
}
My class loops through each character in the file. The function above detects string opening/closing haracters (' and "). 0x34 and 0x39 are the character codes for " and ', respectively. If one of these characters is detected, a is_string_opened will be flipped to true if this is the first character opening the strong, or false if the character closes the string.
Now, my code breaks when I try to minify the following JavaScript (which is taken from the source of Underscore.js):
var entityMap = {
escape: {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''' // Here be dragons
}
};
entityMap.unescape = _.invert(entityMap.escape);
So what's append when the parser reach ''' : The first ' switch the _is_string_open to true. ', which is the HTML hexadecimal entity notation for ', turn it off, and the last ' turn it on again. So the rest of the code is interpreted as text until the next ', which is obviously messing the file parsing process.
I don't understand this PHP behavior. The character code of &;#x27; isn't even 39, it's 38. I ran the code on PHP 5.5.9. The encoding is UTF-8 and the content come directly from POST, but i try to add a htmlentities() to escape this kind of problematic character, nothing changed.
Edit : The data origin (a Controller getting post data)
$js = $_POST['javascript_content'] ?: null;
if($js)
{
$output_js = Jsmin::forge($js)
->min()
->join()
->get();
}
I have been trying to run through a string and find and replace URLs with a link, here what I have come up so far, and it does seem to work for the most part quite well, however there are a few things I'd like to polish. Also it might not be the best performing way of doing that.
I have read many threads on this here on SO, and although it helped a great deal, I still need to tie up the loose ends on it.
I am running through the string two times. The first time I am replacing bbtags with html tags; and the second time I am running through the string and replacing text urls with links:
$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '\2', $body_str);
$body_str = preg_replace_callback(
'!(?:^|[^"\'])(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?!',
function ($matches) {
return strpos(trim($matches[0]), 'thisone.com') == FALSE ?
'' . ltrim($matches[0], "\t\n\r\0\x0B.,#?^=%&:/~\+#'") . '' :
'' . ltrim($matches[0], "\t\n\r\0\x0B.,#?^=%&:/~\+#'") . '';
},
$body_str
);
So far the few problems I am finding with this is it tends to pick up the character immediatelly before 'http' etc e.g. a space/comma/colon etc, which broke the links. Thus I used the preg_replace_callback to work around that and trim some unwanted characters that would break the link.
The other problem is that to avoid breaking links by matching urls, which are already in A-tags I am currently excluding urls starting with a quote,double-quote, and I'd rather use href='|href=" for exclusion.
Any tips and advice will be much appreciated
First i allowed myself to refactor a bit your code to make it easier to read and modify :
function urltrim($str) {
return ltrim($str, " \t\n\r\0\x0B.,#?^=%&:/~\+#'");
}
function addlink($str,$nofollow=true) {
return '<a href="' . urltrim($str) . '"'.($nofollow ? ' rel="nofollow" target="_blank"' : '').'>' . urltrim($str) . '</a>';
}
function checksite($str) {
return strpos(trim($str), 'thisone.com') == FALSE ? addlink($str) : addlink($str,false);
}
$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '\2', $body_str);
$body_str = preg_replace_callback(
'!(?:^|[^"\'])(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?!',
function ($matches) {
return checksite($matches[0]);
},
$body_str
);
After that i changed the way you handle links :
I considered that the URL is a word (= all characters until you find a space or \n or \t (=\s))
I changed the matching method to match the existence of href= in the front of the string
if it exists then I don't do anything, it's already a link
if no href= is present, then i replace the link
So the urltrim method is not useful anymore since I don't eat up the first char before http
And of course, i use urlencode to encode the url and avoid html injection
function urltrim($str) {
return $str;
}
function addlink($str,$nofollow=true) {
$url = preg_replace("#(https?)%3A%2F%2F#","$1://",urlencode(urltrim($str)));
return '<a href="' . $url . '"'.($nofollow ? ' rel="nofollow" target="_blank"' : '').'>' . urltrim($str) . '</a>';
}
function checksite($str) {
return strpos(trim($str), 'thisone.com') == FALSE ? addlink($str) : addlink($str,false);
}
$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '\2', $body_str);
$body_str = preg_replace_callback(
'!(|href=)(["\']?)(https?://[^\s]+)!',
function ($matches) {
if ($matches[1]) {
# If href= is present, dont do anything, return the original string
return $matches[0];
} else {
# add the previous char (" or ') and the link
return $matches[2].checksite($matches[3]);
}
},
$body_str
);
I hope this can help you in your project.
Tell us if it helped.
Bye.
I've a Glype proxy and I want not parse external URLs. All URLs on the page are automatically converted to: http://proxy.com/browse.php?u=[URL HERE]. Example: If I visit The Pirate Bay on my proxy, then I want not to parse the following URLs:
ByteLove.com (Not to: http://proxy.com/browse.php?u=http://bytelove.com&b=0)
BayFiles.com (Not to: http://proxy.com/browse.php?u=http://bayfiles.com&b=0)
BayIMG.com (Not to: http://proxy.com/browse.php?u=http://bayimg.com&b=0)
PasteBay.com (Not to: http://proxy.com/browse.php?u=http://pastebay.com&b=0)
Ipredator.com (Not to: http://proxy.com/browse.php?u=https://ipredator.se&b=0)
etc.
Of course I want to keep the internal URLs, so:
thepiratebay.se/browse (To: http://proxy.com/browse.php?u=http://thepiratebay.se/browse&b=0)
thepiratebay.se/top (To: http://proxy.com/browse.php?u=http://thepiratebay.se/top&b=0)
thepiratebay.se/recent (To: http://proxy.com/browse.php?u=http://thepiratebay.se/recent&b=0)
etc.
Is there a preg_replace to replace all URL's except thepiratebay.se and there subdomains (as in the example)? An other function is also welcome. (Such as domdocument, querypath, substr or strpos. Not str_replace because then I should define all URLs)
I've found something, but I'm not familiar with preg_replace:
$exclude = '.thepiratebay.se';
$pattern = '(https?\:\/\/.*?\..*?)(?=\s|$)';
$message= preg_replace("~(($exclude)?($pattern))~i", '$2$5$6', $message);
I'll guess you would need to provide a whitelist to tell which domains should be proxied:
$whitelist = array();
$whitelist[] = "internal1.se";
$whitelist[] = "internal2.no";
$whitelist[] = "internal3.com";
// and so on...
$string = 'External link 1<br>';
$string .= 'Internal link 1<br>';
$string .= 'Internal link 2<br>';
$string .= 'External link 2<br>';
//Assuming the URL always is inside '' or "" you can use this pattern:
$pattern = '#(https?://proxy\.org/browse\.php\?u=(https?[^&|\"|\']*)(&?[^&|\"|\']*))#i';
$string = preg_replace_callback($pattern, "my_callback", $string);
//I had only PHP 5.2 on my server, so I decided to use a callback function.
function my_callback($match) {
global $whitelist;
// set return bypass proxy URL
$returnstring = urldecode($match[2]);
foreach ($whitelist as $white) {
// check if URL matches whitelist
if (stripos($match[2], $white) > 0) {
$returnstring = $match[0];
break; } }
return $returnstring;
}
echo "NEW STRING[:\n" . $string . "\n]\n";
you can use preg_replace_callback() to execute a callback function for every match. In that function you can determine if the matched string should be converted or not.
<?php
$string = 'http://foobar.com/baz and http://example.org/bumm';
$pattern = '#(https?\:\/\/.*?\..*?)(?=\s|$)#i';
$string = preg_replace_callback($pattern, function($match) {
if (stripos($match[0], 'example.org/') !== false) {
// exclude all URLs containing example.org
return $match[0];
} else {
return 'http://proxy.com/?u=' . urlencode($match[0]);
}
}, $string);
echo $string, "\n";
(Example is using PHP 5.3 closure notation)
I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.
Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.
As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.
The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...
I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.
Here is an example using iconv():
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven't tested it extensively though.
Example:
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
This may be what you were looking for.
Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.
I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:
class utf8
{
/**
* #param array $data
* #param int $options
* #return array
*/
public static function encode(array $data)
{
foreach ($data as $key=>$val) {
if (is_array($val)) {
$data[$key] = self::encode($val, $options);
} else {
if (false === self::check($val)) {
$data[$key] = utf8_encode($val);
}
}
}
return $data;
}
/**
* Regular expression to test a string is UTF8 encoded
*
* RFC3629
*
* #param string $string The string to be tested
* #return bool
*
* #link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
*/
public static function check($string)
{
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$string);
}
}
// For example
$data = utf8::encode($_POST);
For completeness to this question (not necessarily the best answer)...
function as_utf8($s) {
return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}
There is a multibyte extension for PHP. See Multibyte String
You should try the mb_check_encoding() function.
I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.
Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.
The data you store in your database then is data triggered by the user, but not actually user-supplied data.
<?php
// Build alphabet
// Optionally, you can remove characters from this array
$alpha[] = chr(0); // null
$alpha[] = chr(9); // tab
$alpha[] = chr(10); // new line
$alpha[] = chr(11); // tab
$alpha[] = chr(13); // carriage return
for ($i = 32; $i <= 126; $i++) {
$alpha[] = chr($i);
}
/* Remove comment to check ASCII ordinals */
// /*
// foreach ($alpha as $key => $val) {
// print ord($val);
// print '<br/>';
// }
// print '<hr/>';
//*/
//
// // Test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv ' . chr(160) . chr(127) . chr(126);
//
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// // Test case #2
//
// $str = '' . '©?™???';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// $str = '©';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10), file($file));
$string = teststr($alpha, $testfile);
print $string;
print '<hr/>';
function teststr(&$alpha, &$str) {
$strlen = strlen($str);
$newstr = chr(0); // null
$x = 0;
if($strlen >= 2) {
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i], $alpha)) {
// Passed
$newstr .= $str[$i];
}
else {
// Failed
print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
print '<br/>';
$newstr .= '�';
}
}
}
elseif($strlen <= 0) {
// Failed to qualify for test
print 'Non-existent.';
}
elseif($strlen === 1) {
$x++;
if(in_array($str, $alpha)) {
// Passed
$newstr = $str;
}
else {
// Failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}
else {
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
// Skip
}
else {
$newstr = utf8_encode($newstr);
}
// Test encoding:
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
print 'UTF-8 :D<br/>';
}
else {
print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
}
return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
}
Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.
You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.
Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.info or the initial patch for an explanation.
To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ which can only be from the Unicode charset (and, in this example, not the Korean charset).
Set UTF-8 as the character set for all headers output by your PHP code.
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');