Escaping double quotes in strings with regex - php

This is a followup from another post at here.
Problem: The code below works good with the exception of strings that contain double quotes which will render strange characters
Sample string:
“Walter Isaacson http://t.co/vaLxVduA”
Rendered as:
“Walter Isaacson http://t.co/vaLxVduA���
t.co/vaLxVduA���
I believe the problem is in the regex. What could I try to make this work?
Code:
function makeLink($match) {
// Parse link.
$substr = substr($match, 0, 6);
if ($substr != 'http:/' && $substr != 'https:' && $substr != 'ftp://' && $substr != 'news:/' && $substr != 'file:/') {
$url = 'http://' . $match;
} else {
$url = $match;
}
return '' . $match . '';
}
function makeHyperlinks($text) {
// Find links and call the makeLink() function on them.
return preg_replace('/((www\.|http|https|ftp|news|file):\/\/[\w.-]+\.[\w\/:#=.+?,#%&~-]*[^.\'# !(?,><;\)])/e', "makeLink('$1')", $text);
}

The problem is die unicode character ”. When you add the u modifier, to treat every string as UTF-8, it works, but also catches the quote as part of the URL. You would need to exclude this quote also:
preg_replace('/((www\.|http|https|ftp|news|file):\/\/[\w.-]+\.[\w\/:#=.+?,#%&~-]*[^.\'# !(?,>”<;\)])/eu', "makeLink('$1')", $text);
But your regex looks kinda huge, I did a quick search for a URL regex and found this one, it seems to work also, and don't need all the exclusions
preg_replace('#(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)#eu', "makeLink('$1')", $text);

Related

php regex preg_replace_callback

I have some inherited code whose purpose is to identify urls in a string an prepend the http:// protocol onto them if it doesn't exist.
return preg_replace_callback(
'/((https?:\/\/)?\w+(\.\w{2,})+[\w?&%=+\/]+)/i',
function ($match) {
if (stripos($match[1], 'http://') !== 0 && stripos($match[1], 'https://') !== 0) {
$match[1] = 'http://' . $match[1];
}
return $match[1];
},
$string);
It's working, except when a domain has a hyphen it. So, for-instance, the following string will only partially work.
$string = "In front mfever.com/1 middle http://mf-ever.com/2 at the end";
Can any regex genius see what's wrong with it?
You just need to add the optional dash:
((https?:\/\/)?\w+\-?\w+(\.\w{2,})+[\w?&%=+\/]+)
See it work here https://regex101.com/r/Tkdapj/1

Convert \n in a clear space

i have a problem with a function in php i want to convert all the "\n" in a clear space, i've tried with this, but it doesn't work
function clean($text) {
if ($text === null) return null;
if (strstr($text, "\xa7") || strstr($text, "&")) {
$text = preg_replace("/(?i)(\x{00a7}|&)[0-9A-FK-OR]/u", "", $text);
}
$text = htmlspecialchars($text, ENT_QUOTES, "UTF-8");
if (strstr($text, "\n")) {
$text = preg_replace("\n", "", $text);
}
return $text;
}
This is wat i want remove
The site: click here
If you literally have "\n" in your text, which appears to be the case from your screenshots, then do the following:
$text = str_replace("\\n", '', $text);
\n is a special character in PHP that creates new lines, so we need to add the escape character \ in front of it in order to remove text instances of "\n".
preg_replace() seems to work better this way:
$text = preg_replace('/\n/',"",$text);
Single quotes enforce no substitution when sending your pattern to the parser.

Apostrophe vs its html hexadecimal notation conflict

I'm writing a little class to minify JavaScript with PHP. I have the following problematic code in my class:
private function test_opener($str, $i) {
if(ord($str[$i]) === 34 or ord($str[$i]) === 39)
{
if($this->_is_string_opened)
{
if($this->_string_opener === $str[$i] and ! $this->is_escaped($str, $i))
{
$this->_is_string_opened = false;
$this->_string_opener = null;
}
}
else
{
$this->_is_string_opened = true;
$this->_string_opener = $str[$i];
}
}
}
My class loops through each character in the file. The function above detects string opening/closing haracters (' and "). 0x34 and 0x39 are the character codes for " and ', respectively. If one of these characters is detected, a is_string_opened will be flipped to true if this is the first character opening the strong, or false if the character closes the string.
Now, my code breaks when I try to minify the following JavaScript (which is taken from the source of Underscore.js):
var entityMap = {
escape: {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''' // Here be dragons
}
};
entityMap.unescape = _.invert(entityMap.escape);
So what's append when the parser reach ''' : The first ' switch the _is_string_open to true. ', which is the HTML hexadecimal entity notation for ', turn it off, and the last ' turn it on again. So the rest of the code is interpreted as text until the next ', which is obviously messing the file parsing process.
I don't understand this PHP behavior. The character code of &;#x27; isn't even 39, it's 38. I ran the code on PHP 5.5.9. The encoding is UTF-8 and the content come directly from POST, but i try to add a htmlentities() to escape this kind of problematic character, nothing changed.
Edit : The data origin (a Controller getting post data)
$js = $_POST['javascript_content'] ?: null;
if($js)
{
$output_js = Jsmin::forge($js)
->min()
->join()
->get();
}

How to filter URLs that contain white space with preg match?

I parse through a text that contains several links. Some of them contain white spaces but have a file ending. My current pattern is:
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $links, $match);
This works the same way:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $links, $match);
I don't know much about the patterns and didn't find a good tutorial that explains the meaning of all possible patterns and shows examples.
How could I filter an URL like this:
http://my-url.com/my doc.doc or even http://my-url.com/my doc with more white spaces.doc
The \s in that preg_match_all functions stands for a white space. But how could I check if there is a file ending behind one or some white spaces?
Is it possible?
Why not just make use of PHP's FILTER functions. ?
<?php
$url = "http://my-url.com/my doc.doc";
if(!filter_var($url, FILTER_VALIDATE_URL))
{
echo "URL is not valid";
}
else
{
echo "URL is valid";
}
OUTPUT :
URL is not valid
this might be what you are looking for which uses urlencode
$file = "my doc with more white spaces.doc";
echo " http://my-url.com/" . urlencode($file);
which produces:
http://my-url.com/my+doc+with+more+white+spaces.doc
or with rawurlencode
produces:
http://my-url.com/my%20doc%20with%20more%20white%20spaces.doc
EDIT: Something like the following might help to parse your urls with parse_url
DEMO
$url = 'http://my-url.com/my doc with more white spaces.doc';
$purl = parse_url($url);
$rurl = "";
if(isset($purl['scheme'])){
$rurl .= $purl['scheme'] . "://";
}
if(isset($purl['host'], $purl['path'])){
$rurl .= $purl['host'] . rawurlencode($purl['path']);
}
if($rurl === ""){
$rurl = $url;#error parsing error/invalid url?
}
for sub directories you can do
$purl['path'] = implode('/', array_map(function($value){return rawurlencode($value);}, explode('/', $purl['path'])));
I don't know much about php but this regex
(http|ftp)(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
will match every url even with spaces
I think this regex will do.
use this regex
preg_match_all("/^(?si)(?>\s*)(((?>https?:\/\/(?>www\.)?)?(?=[\.-a-z0-9]{2,253}(?>$|\/|\?|\s))[a-z0-9][a-z0-9-]{1,62}(?>\.[a-z0-9][a-z0-9-]{1,62})+)(?>(?>\/|\?).*)?)?(?>\s*)$/", $input_lines, $output_array);
Demo
Alright after doing this really helpful tutorial I finally know how the regex syntax works. After finishing it I experimented a bit on this site
It was pretty easy after figuring out that all hyperlinks in my parsed document were in between quotation marks so I just had to change the regex to:
preg_match_all('#\bhttps?://[^()<>"]+#', $links, $match);
so that after the " it is looking for the next match that begins with http.
But that's not the full solution yet. The user Class was right - without rawurlencode the filenames it won't work.
So the next step was this:
function endsWith($haystack, $needle)
{
return $needle === "" || substr($haystack, -strlen($needle)) === $needle;
}
if(endsWith($textlink, ".doc") || endsWith($textlink, ".docx") || endsWith($textlink, ".pdf") || endsWith($textlink, ".jpg") || endsWith($textlink, ".jpeg") || endsWith($textlink, ".png")){
$file = substr( $textlink, strrpos( $textlink, '/' )+1 );
$rest_url=substr($textlink, 0, strrpos($textlink, '/' )+1 );
$textlink=$rest_url.rawurlencode($file);
}
That filters the filenames from the URLs and rawurlencodes them so that the the output links are correct.
I think this should work:
$url = '...';
$url_new = '';
$array = explode(' ',$url);
foreach($array as $name => $val){
if ($val!=' '){
$url_new = $url_new.$val;
}
}

Codeigniter and preg_replace

I use Codeigniter to create a multilingual website and everything works fine, but when I try to use the "alternative languages helper" by Luis I've got a problem. This helper uses a regular expression to replace the current language with the new one:
$new_uri = preg_replace('/^'.$actual_lang.'/', $lang, $uri);
The problem is that I have a URL like this: http://www.example.com/en/language/english/ and I want to replace only the first "en" without changing the word "english". I tried to use the limit for preg_replace:
$new_uri = preg_replace('/^'.$actual_lang.'/', $lang, $uri, 1);
but this doesn't work for me. Any ideas?
You could do something like this:
$regex = '#^'.preg_quote($actual_lang, '#').'(?=/|$)#';
$new_uri = preg_replace($regex, $lang, $uri);
The last capture pattern basically means "only match if the next character is a forward slash or the end of the string"...
Edit:
If the code you always want to replace is at the beginning of the path, you could always do:
if (stripos($url, $actual_lang) !== false) {
if (strpos($url, '://') !== false) {
$path = parse_url($url, PHP_URL_PATH);
} else {
$path = $url;
}
list($language, $rest) = explode('/', $path, 2);
if ($language == $actual_lang) {
$url = str_replace($path, $lang . '/' . $rest, $url);
}
}
It's a bit more code, but it should be fairly robust. You could always build a class to do this for you (by parsing, replacing and then rebuilding the URL)...
If you know what the beginning of the URL will always, be, use it in the regex!
$domain = "http://www.example.com/"
$regex = '#(?<=^' . preg_quote($domain, '#') . ')' . preg_quote($actual_lang, '#') . '\b#';
$new_uri = preg_replace($regex, $lang, $uri);
In the case of your example, the regular expression would become #(?<=^http://www.example.com/)en\b which would match en only if it followed the specified beginning of a domain ((?<=...) in a regular expression specifies a positive lookbehind) and is followed by a word boundary (so english wouldn't match).

Categories