I want to ask 2 questions about url conversion in php.
1 question: I need to convert text into link. I've done my own preg and also read many forums, but all solutions are connected with www. or (ht|f)tp(s), but I need preg that will convert domain names even without www and http in text, for example:
I like stackoverflow.com very much
into
I like <a href='http://stackoverflow.com'>stackoverflow.com</a> very much
Sure it must consider points and commas and etc., like:
I like stackoverflow.com.
into
I like <a href='http://stackoverflow.com'>stackoverflow.com</a>.
And one more question: links with url-encoded symbols on wiki are displayed as they are, but on other sites they are displayed like url-encoded string (%XX%XX%XX). How did wiki do this? Thanks!
For your first question, I would not recommend you to that, it is very difficult to know if the a word containing a dot is a domain name or not, and people often forget to put a space after the dot in the middle of a paragraph.
for your second question, it is simple, you url encode the link in the href but not between the open and close a tag. For example :
http://site.na/test.php?t=sqdl&54"dfd°=+
The function auto_link from CodeIgniter URL helper can help you:
if ( ! function_exists('auto_link'))
{
function auto_link($str, $type = 'both', $popup = FALSE)
{
if ($type != 'email')
{
if (preg_match_all("#(^|\s|\()((http(s?)://)|(www\.))(\w+[^\s\)\<]+)#i", $str, $matches))
{
$pop = ($popup == TRUE) ? " target=\"_blank\" " : "";
for ($i = 0; $i < count($matches['0']); $i++)
{
$period = '';
if (preg_match("|\.$|", $matches['6'][$i]))
{
$period = '.';
$matches['6'][$i] = substr($matches['6'][$i], 0, -1);
}
$str = str_replace($matches['0'][$i],
$matches['1'][$i].'<a href="http'.
$matches['4'][$i].'://'.
$matches['5'][$i].
$matches['6'][$i].'"'.$pop.'>http'.
$matches['4'][$i].'://'.
$matches['5'][$i].
$matches['6'][$i].'</a>'.
$period, $str);
}
}
}
if ($type != 'url')
{
if (preg_match_all("/([a-zA-Z0-9_\.\-\+]+)#([a-zA-Z0-9\-]+)\.([a-zA-Z0-9\-\.]*)/i", $str, $matches))
{
for ($i = 0; $i < count($matches['0']); $i++)
{
$period = '';
if (preg_match("|\.$|", $matches['3'][$i]))
{
$period = '.';
$matches['3'][$i] = substr($matches['3'][$i], 0, -1);
}
$str = str_replace($matches['0'][$i], safe_mailto($matches['1'][$i].'#'.$matches['2'][$i].'.'.$matches['3'][$i]).$period, $str);
}
}
}
return $str;
}
}
Related
Say I have string such as below:
"b<a=2<sup>2</sup>"
Actually its a formula. I need to display this formula on webpage but after b string is hiding because its considered as broken anchor tag. I tried with htmlspecialchars method but it returns complete string as plain text. I am trying with some regex but I can get only text between some tags.
UPDATE:
This seems to work with this formula:
"(c<a) = (b<a) = 2<sup>2</sup>"
And even with this formula:
"b<a=2<sup>2</sup>"
HERE'S THE MAGIC:
<?php
$_string = "b<a=2<sup>2</sup>";
$string = "(c<a) = (b<a) = 2<sup>2</sup>";
$open_sup = strpos($string,"<sup>");
$close_sup = strpos($string,"</sup>");
$chars_array = str_split($string);
foreach($chars_array as $index => $char)
{
if($index != $open_sup && $index != $close_sup)
{
if($char == "<")
{
echo "<";
}
else{
echo $char;
}
}
else{
echo $char;
}
}
OLD SOLUTION (DOESN'T WORK)
Maybe this can help:
I've tried to backslash chars, but it doesn't work as expected.
Then i've tried this one:
<?php
$string = "b<a=2<sup>2</sup>";
echo $string;
?>
Using < html entity it seems to work if i understood your problem...
Let me know
Probably you can give spaces such as :
b < a = 2<sup>2</sup>
It does not disappear the tag and looks much more understanding....
You could try this regex approach, which should skip elements.
$regex = '/<(.*?)\h*.*>.+<\/\1>(*SKIP)(*FAIL)|(<|>)/';
$string = 'b<a=2<sup>2</sup>';
$string = preg_replace_callback($regex, function($match) {
return htmlentities($match[2]);
}, $string);
echo $string;
Output:
b<a=2<sup>2</sup>
PHP Demo: https://eval.in/507605
Regex101: https://regex101.com/r/kD0iM0/1
I'm using PHP's strip_tags() function to strip tags from a string. For example:
$text = strip_tags( $text );
My aim is to strip all tags unless the tags happen to be contained inside backticks. If tags are contained inside backticks, I don't want to strip them.
My first thought was to try using the second parameter of strip_tags(). This will let me specify allowable tags which are not to be removed. For example, strip_tags( $text, '<strong>'). However, this doesn't quite do what I'm looking for.
How can I strip all HTML tags from a string except tags that happen to be contained inside backticks?
Ref: http://php.net/manual/en/function.strip-tags.php
To back my comment with an answer, something like:
function strip($input)
{
preg_match_all('/`([^`]+)`/', $input, $retain);
for($i = 0; $i < count($retain[0]); $i++)
{
// Replace HTML wrapped in backticks with match index.
$input = str_replace($retain[0][$i], "{{$i}}", $input);
}
// Strip tags.
$input = strip_tags($input);
for($i = 0; $i < count($retain[0]); $i++)
{
// Replace previous replacements with relevant data.
$replace = $retain[1][$i];
// Do some stuff with $replace here - maybe check that it's a tag
// you're comfortable with else use htmlspecialchars(), etc.
// ...
$input = str_replace("{{$i}}", $replace, $input);
}
return $input;
}
With a test:
echo strip("Hello <strong>there</strong>, what's `<em>`up`</em>`?");
// Output: Hello there, what's <em>up</em>?
If your escape sequence is fixed as a `, you can do something much simpler than obfuscation (which Marty suggests in his comment, and which is one of my favorite techniques if I'm being perfectly honest). Even if you were to use obfuscation or a preg_replace, you would still need to account for escaped ticks.
Instead, you can do something like:
$strippeddown = array();
$breakdown = explode('`', $text);
$j = 1;
foreach ($breakdown AS $i => $gather)
{
if ($j > 1)
{
$j--;
unset($breakdown["$i"]);
continue;
}
$j = 1;
while (strrpos($gather, '\\') === 0 AND isset($breakdown[$i + $j]))
{
$gather = $breakdown[$i + $j];
$breakdown["$i"] .= '`' . $gather;
$j++;
}
}
$breakdown = array_values($breakdown);
foreach ($breakdown AS $i => $gather)
{
if (!$i OR !($i % 2))
{
$strippeddown[] = strip_tags($gather);
}
else
{
$strippeddown[] = $gather;
}
}
$text = implode('`', $strippeddown);
I need to convert lib_someString to someString inside a block of text using str_replace [not regex].
Here's an example to give an exact sense what I mean: lib_12345 => 12345. I need to do this for a bunch of instances in a block of text.
Below is my attempt. Problem I'm getting is that my function is not doing anything (I just get lib_id returned).
function extractLibId($val){ // function to get the "12345" in the above example
$lclRetVal = substr($val, 5, strlen($val));
return $lclRetVal;
}
function Lib($text){ // does the replace for all lib_ instances in the text
$lclVar = "lib_";
$text = str_replace($lclVar, "<a href='".extractLibId($lclVar)."'>".extractLibId($lclVar)."</a>", $text);
return $text;
}
Regexp gonna be faster and more clear, you will have no need to call your function for every possible 'lib_' string:
function Lib($text) {
$count = null;
return preg_replace('/lib_([0-9]+)/', '$1', $text, -1, $count);
}
$text = 'some text lib_123123 goes here lib_111';
$text = Lib($text);
Without regexp, but every time Lib2 will be called somewhere will die cute kitten:
function extractLibId($val) {
$lclRetVal = substr($val, 4);
return $lclRetVal;
}
function Lib2($text) {
$count = null;
while (($pos = strpos($text, 'lib_')) !== false) {
$end = $pos;
while (!in_array($text[$end], array(' ', ',', '.')) && $end < strlen($text))
$end++;
$sub = substr($text, $pos, $end - $pos);
$text = str_replace($sub, ''.extractLibId($sub).'', $text);
}
return $text;
}
$text = 'some text lib_123123 goes here lib_111';
$text = Lib2($text);
Use preg_replace.
Although it is possible to do what you need without regular expressions, you say you don't want to use them because of performance reasons. I doubt the other solution will be faster, so here is a simple regex to benchmark against:
echo preg_replace("/lib_(\w+)/", '$1', $str);
As shown here: http://codepad.org/xGj78r9r
Ignoring how ridiculous area of optimizing this is, even the simplest implementation with minimal validation already takes only 33% less time than a regex
<?php
function uselessFunction( $val ) {
if( strpos( $val, "lib_" ) !== 0 ) {
return $val;
}
$str = substr( $val, 4 );
return "{$str}";
}
$l = 100000;
$now = microtime(TRUE);
while( $l-- ) {
preg_replace( '/^lib_(.*)$/', "$1", 'lib_someString' );
}
echo (microtime(TRUE)-$now)."\n";
//0.191093
$l = 100000;
$now = microtime(TRUE);
while( $l-- ) {
uselessFunction( "lib_someString" );
}
echo (microtime(TRUE)-$now);
//0.127598
?>
If you're restricted from using a regex, you're going to have difficult time searching for a string you describe as "someString", i.e. not precisely known in advance. If you know the string is exactly lib_12345, for example, then set $lclVar to that string. On the other hand, if you don't know the exact string in advance, you'll have to use a regex via preg_replace() or a similar function.
Im working on a commenting web application and i want to parse user mentions (#user) as links. Here is what I have so far:
$text = "#user is not #user1 but #user3 is #user4";
$pattern = "/\#(\w+)/";
preg_match_all($pattern,$text,$matches);
if($matches){
$sql = "SELECT *
FROM users
WHERE username IN ('" .implode("','",$matches[1]). "')
ORDER BY LENGTH(username) DESC";
$users = $this->getQuery($sql);
foreach($users as $i=>$u){
$text = str_replace("#{$u['username']}",
"<a href='#' class='ct-userLink' rel='{$u['user_id']}'>#{$u['username']}</a> ", $text);
}
$echo $text;
}
The problem is that user links are being overlapped:
<a rel="11327" class="ct-userLink" href="#">
<a rel="21327" class="ct-userLink" href="#">#user</a>1
</a>
How can I avoid links overlapping?
Answer Update
Thanks to the answer picked, this is how my new foreach loop looks like:
foreach($users as $i=>$u){
$text = preg_replace("/#".$u['username']."\b/",
"<a href='#' title='{$u['user_id']}'>#{$u['username']}</a> ", $text);
}
Problem seems to be that some usernames can encompass other usernames. So you replace user1 properly with <a>user1</a>. Then, user matches and replaces with <a><a>user</a>1</a>. My suggestion is to change your string replace to a regex with a word boundary, \b, that is required after the username.
The Twitter widget has JavaScript code to do this. I ported it to PHP in my WordPress plugin. Here's the relevant part:
function format_tweet($tweet) {
// add #reply links
$tweet_text = preg_replace("/\B[#@]([a-zA-Z0-9_]{1,20})/",
"#<a class='atreply' href='http://twitter.com/$1'>$1</a>",
$tweet);
// make other links clickable
$matches = array();
$link_info = preg_match_all("/\b(((https*\:\/\/)|www\.)[^\"\']+?)(([!?,.\)]+)?(\s|$))/",
$tweet_text, $matches, PREG_SET_ORDER);
if ($link_info) {
foreach ($matches as $match) {
$http = preg_match("/w/", $match[2]) ? 'http://' : '';
$tweet_text = str_replace($match[0],
"<a href='" . $http . $match[1] . "'>" . $match[1] . "</a>" . $match[4],
$tweet_text);
}
}
return $tweet_text;
}
instead of parsing for '#user' parse for '#user ' (with space in the end) or ' #user ' to even avoid wrong parsing of email addresses (eg: mailaddress#user.com) maybe ' #user: ' should also be allowed. this will only work, if usernames have no whitespaces...
You can go for a custom str replace function which stops at first replace.. Something like ...
function str_replace_once($needle , $replace , $haystack){
$pos = strpos($haystack, $needle);
if ($pos === false) {
// Nothing found
return $haystack;
}
return substr_replace($haystack, $replace, $pos, strlen($needle));
}
And use it like:
foreach($users as $i=>$u){
$text = str_replace_once("#{$u['username']}",
"<a href='#' class='ct-userLink' rel='{$u['user_id']}'>#{$u['username']}</a> ", $text);
}
You shouldn’t replace one certain user mention at a time but all at once. You could use preg_split to do that:
// split text at mention while retaining user name
$parts = preg_split("/#(\w+)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
// $n is always an odd number; 1 means no match found
if ($n > 1) {
// collect user names
$users = array();
for ($i=1; $i<$n; $i+=2) {
$users[$parts[$i]] = '';
}
// get corresponding user information
$sql = "SELECT *
FROM users
WHERE username IN ('" .implode("','", array_keys($users)). "')";
$users = array();
foreach ($this->getQuery($sql) as $user) {
$users[$user['username']] = $user;
}
// replace mentions
for ($i=1; $i<$n; $i+=2) {
$u = $users[$parts[$i]];
$parts[$i] = "<a href='#' class='ct-userLink' rel='{$u['user_id']}'>#{$u['username']}</a>";
}
// put everything back together
$text = implode('', $parts);
}
I like dnl solution of parsing ' #user', but maybe is not suitable for you.
Anyway, did you try to use strip_tags function to remove the anchor tags? That way you have the string without the links, and you can parse it building the links again.
strip_tags
If I have a string that contains a url (for examples sake, we'll call it $url) such as;
$url = "Here is a funny site http://www.tunyurl.com/34934";
How do i remove the URL from the string?
Difficulty is, urls might also show up without the http://, such as ;
$url = "Here is another funny site www.tinyurl.com/55555";
There is no HTML present. How would i start a search if http or www exists, then remove the text/numbers/symbols until the first space?
I re-read the question, here is a function that would work as intended:
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,'http') || (count(explode('.',$u)) > 1)) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
return implode(' ',$U);
}
$url = "Here is another funny site www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
Edit #2/#3 (I must be bored). Here is a version that verifies there is a TLD within the URL:
function containsTLD($string) {
preg_match(
"/(AC($|\/)|\.AD($|\/)|\.AE($|\/)|\.AERO($|\/)|\.AF($|\/)|\.AG($|\/)|\.AI($|\/)|\.AL($|\/)|\.AM($|\/)|\.AN($|\/)|\.AO($|\/)|\.AQ($|\/)|\.AR($|\/)|\.ARPA($|\/)|\.AS($|\/)|\.ASIA($|\/)|\.AT($|\/)|\.AU($|\/)|\.AW($|\/)|\.AX($|\/)|\.AZ($|\/)|\.BA($|\/)|\.BB($|\/)|\.BD($|\/)|\.BE($|\/)|\.BF($|\/)|\.BG($|\/)|\.BH($|\/)|\.BI($|\/)|\.BIZ($|\/)|\.BJ($|\/)|\.BM($|\/)|\.BN($|\/)|\.BO($|\/)|\.BR($|\/)|\.BS($|\/)|\.BT($|\/)|\.BV($|\/)|\.BW($|\/)|\.BY($|\/)|\.BZ($|\/)|\.CA($|\/)|\.CAT($|\/)|\.CC($|\/)|\.CD($|\/)|\.CF($|\/)|\.CG($|\/)|\.CH($|\/)|\.CI($|\/)|\.CK($|\/)|\.CL($|\/)|\.CM($|\/)|\.CN($|\/)|\.CO($|\/)|\.COM($|\/)|\.COOP($|\/)|\.CR($|\/)|\.CU($|\/)|\.CV($|\/)|\.CX($|\/)|\.CY($|\/)|\.CZ($|\/)|\.DE($|\/)|\.DJ($|\/)|\.DK($|\/)|\.DM($|\/)|\.DO($|\/)|\.DZ($|\/)|\.EC($|\/)|\.EDU($|\/)|\.EE($|\/)|\.EG($|\/)|\.ER($|\/)|\.ES($|\/)|\.ET($|\/)|\.EU($|\/)|\.FI($|\/)|\.FJ($|\/)|\.FK($|\/)|\.FM($|\/)|\.FO($|\/)|\.FR($|\/)|\.GA($|\/)|\.GB($|\/)|\.GD($|\/)|\.GE($|\/)|\.GF($|\/)|\.GG($|\/)|\.GH($|\/)|\.GI($|\/)|\.GL($|\/)|\.GM($|\/)|\.GN($|\/)|\.GOV($|\/)|\.GP($|\/)|\.GQ($|\/)|\.GR($|\/)|\.GS($|\/)|\.GT($|\/)|\.GU($|\/)|\.GW($|\/)|\.GY($|\/)|\.HK($|\/)|\.HM($|\/)|\.HN($|\/)|\.HR($|\/)|\.HT($|\/)|\.HU($|\/)|\.ID($|\/)|\.IE($|\/)|\.IL($|\/)|\.IM($|\/)|\.IN($|\/)|\.INFO($|\/)|\.INT($|\/)|\.IO($|\/)|\.IQ($|\/)|\.IR($|\/)|\.IS($|\/)|\.IT($|\/)|\.JE($|\/)|\.JM($|\/)|\.JO($|\/)|\.JOBS($|\/)|\.JP($|\/)|\.KE($|\/)|\.KG($|\/)|\.KH($|\/)|\.KI($|\/)|\.KM($|\/)|\.KN($|\/)|\.KP($|\/)|\.KR($|\/)|\.KW($|\/)|\.KY($|\/)|\.KZ($|\/)|\.LA($|\/)|\.LB($|\/)|\.LC($|\/)|\.LI($|\/)|\.LK($|\/)|\.LR($|\/)|\.LS($|\/)|\.LT($|\/)|\.LU($|\/)|\.LV($|\/)|\.LY($|\/)|\.MA($|\/)|\.MC($|\/)|\.MD($|\/)|\.ME($|\/)|\.MG($|\/)|\.MH($|\/)|\.MIL($|\/)|\.MK($|\/)|\.ML($|\/)|\.MM($|\/)|\.MN($|\/)|\.MO($|\/)|\.MOBI($|\/)|\.MP($|\/)|\.MQ($|\/)|\.MR($|\/)|\.MS($|\/)|\.MT($|\/)|\.MU($|\/)|\.MUSEUM($|\/)|\.MV($|\/)|\.MW($|\/)|\.MX($|\/)|\.MY($|\/)|\.MZ($|\/)|\.NA($|\/)|\.NAME($|\/)|\.NC($|\/)|\.NE($|\/)|\.NET($|\/)|\.NF($|\/)|\.NG($|\/)|\.NI($|\/)|\.NL($|\/)|\.NO($|\/)|\.NP($|\/)|\.NR($|\/)|\.NU($|\/)|\.NZ($|\/)|\.OM($|\/)|\.ORG($|\/)|\.PA($|\/)|\.PE($|\/)|\.PF($|\/)|\.PG($|\/)|\.PH($|\/)|\.PK($|\/)|\.PL($|\/)|\.PM($|\/)|\.PN($|\/)|\.PR($|\/)|\.PRO($|\/)|\.PS($|\/)|\.PT($|\/)|\.PW($|\/)|\.PY($|\/)|\.QA($|\/)|\.RE($|\/)|\.RO($|\/)|\.RS($|\/)|\.RU($|\/)|\.RW($|\/)|\.SA($|\/)|\.SB($|\/)|\.SC($|\/)|\.SD($|\/)|\.SE($|\/)|\.SG($|\/)|\.SH($|\/)|\.SI($|\/)|\.SJ($|\/)|\.SK($|\/)|\.SL($|\/)|\.SM($|\/)|\.SN($|\/)|\.SO($|\/)|\.SR($|\/)|\.ST($|\/)|\.SU($|\/)|\.SV($|\/)|\.SY($|\/)|\.SZ($|\/)|\.TC($|\/)|\.TD($|\/)|\.TEL($|\/)|\.TF($|\/)|\.TG($|\/)|\.TH($|\/)|\.TJ($|\/)|\.TK($|\/)|\.TL($|\/)|\.TM($|\/)|\.TN($|\/)|\.TO($|\/)|\.TP($|\/)|\.TR($|\/)|\.TRAVEL($|\/)|\.TT($|\/)|\.TV($|\/)|\.TW($|\/)|\.TZ($|\/)|\.UA($|\/)|\.UG($|\/)|\.UK($|\/)|\.US($|\/)|\.UY($|\/)|\.UZ($|\/)|\.VA($|\/)|\.VC($|\/)|\.VE($|\/)|\.VG($|\/)|\.VI($|\/)|\.VN($|\/)|\.VU($|\/)|\.WF($|\/)|\.WS($|\/)|\.XN--0ZWM56D($|\/)|\.XN--11B5BS3A9AJ6G($|\/)|\.XN--80AKHBYKNJ4F($|\/)|\.XN--9T4B11YI5A($|\/)|\.XN--DEBA0AD($|\/)|\.XN--G6W251D($|\/)|\.XN--HGBK6AJ7F53BBA($|\/)|\.XN--HLCJ6AYA9ESC7A($|\/)|\.XN--JXALPDLP($|\/)|\.XN--KGBECHTV($|\/)|\.XN--ZCKZAH($|\/)|\.YE($|\/)|\.YT($|\/)|\.YU($|\/)|\.ZA($|\/)|\.ZM($|\/)|\.ZW)/i",
$string,
$M);
$has_tld = (count($M) > 0) ? true : false;
return $has_tld;
}
function cleaner($url) {
$U = explode(' ',$url);
$W =array();
foreach ($U as $k => $u) {
if (stristr($u,".")) { //only preg_match if there is a dot
if (containsTLD($u) === true) {
unset($U[$k]);
return cleaner( implode(' ',$U));
}
}
}
return implode(' ',$U);
}
$url = "Here is another funny site badurl.badone somesite.ca/worse.jpg but this badsite.com www.tinyurl.com/55555 and http://www.tinyurl.com/55555 and img.hostingsite.com/badpic.jpg";
echo "Cleaned: " . cleaner($url);
returns:
Cleaned: Here is another funny site badurl.badone but this and and
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
Parsing text for URLs is hard and looking for pre-existing, heavily tested code that already does this for you would be better than writing your own code and missing edge cases. For example, I would take a look at the process in Django's urlize, which wraps URLs in anchors. You could port it over to PHP, and--instead of wrapping URLs in an anchor--just delete them from the text.
thanks mike,
update a bit, it return notice error,
'/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i'
$string = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i', '', $string);
$url = "Here is a funny site http://www.tunyurl.com/34934";
$replace = 'http www .com .org .net';
$with = '';
$clean_url = clean($url,$replace,$with);
echo $clean_url;
function clean($url,$replace,$with) {
$replace = explode(" ",$replace);
$new_string = '';
$check = explode(" ",$url);
foreach($check AS $key => $value) {
foreach($replace AS $key2 => $value2 ) {
if (-1 < strpos( strtolower($value), strtolower($value2) ) ) {
$value = $with;
break;
}
}
$new_string .= " ".$value;
}
return $new_string;
}
You would need to write a regular expression to extract out the urls.