Problems with PHP, preg_replace & regular expressions - php

I'm trying to run this php command:
preg_replace($regexp, $replace, $text, $maxsingle);
Where the vars are:
$regexp = '/(?!(?:[^<\\[]+[>\\]]|[^>\\]]+<\\/a>))\\b(שלום)\\b/imsU';
$replace = '<a title="$1" href="http://stackoverflow.com">$1</a>';
$text is a long post
$maxsingle = 3;
When the text I'm trying to match (in the above case "שלום") is in english everything works. However, when the text is Hebrew, it doesn't matches anything...
Any ideas how to make Hebrew work with preg_replace?
Thanks.

Try using the /u (utf-8) flag

Related

Regular expression not working as intended when I use an emoji at the beginning of a string

My code is written in PHP. I am trying to store in my database subjects of the emails that I send, only after I remove the emojis that I include in the subject lines of those emails. I created this regular expression:
$cleansubject = preg_replace("/[^a-zA-Z0-9\s]/", "", $subject);
It works when I have the emoji at the end of the string, such as:
But if the emoji I have it at the beginning of the string, it does not work, the entry is not even stored in my database:
Any issues that you can identify in my regular expression to achieve what I want?
UPDATE 1: Apparently the regular expression is just fine:
Add the "u" modifier to your regular expression to make it treat strings as UTF-8.
$cleansubject = preg_replace("/[^a-zA-Z0-9\s]/u", "", $subject);
Or use a built-in function to remove the Unicode characters from your string, eg iconv, utf8_decode, mb_convert_encoding, or recode.
$cleansubject = trim(iconv('UTF-8', 'ASCII//IGNORE', $subject));
This could be an encoding problem (3v4l example):
echo utf8_encode('⌨️,🖥,🖨, Learning Online: Digital Marketing Course');
// Output: ⌨ï¸,🖥,🖨, Learning Online: Digital Marketing Course
When you try to match using your pattern this fails (see here), but if you instead match any number of non-word characters without the global flag like here you match the whole emoji.
And using preg_match() this becomes:
$re = '/\W*/';
$str = 'â¨ï¸,ð¥,ð¨, Learning online: Digital Marketing Course';
$subst = '';
$result = preg_replace($re, $subst, $str, 1);
echo "The result of the substitution is ".$result;
// Output: Learning online: Digital Marketing Course

Preg_match for different language URLs

I have some text like this :
$text = "Some thing is there http://example.com/جميع-وظائف-فى-السليمانية
http://www.example.com/جميع-وظائف-فى-السليمانية nothing is there
Check me http://example.com/test/for_me first
testing http://www.example.com/test/for_me the url
Should be test http://www.example.com/翻译-英语教师-中文教师-外贸跟单
simple text";
I need to preg_match the URL, but they are of different languages.
So, I need to get the URL itself, from each line.
I was doing like this :
$text = preg_replace("/[\n]/", " <br>", $text);
$lines = explode("<br>", $text);
foreach($line as $textLine){
if (preg_match("/(http\:\/\/(.*))/", $textLine, $match )) {
// some code
// Here I need the url
}
}
My current regex is /(http\:\/\/(.*))/, please suggest how I can make this compatible with the URLs in different languages?
A regular expression like this may work for you?
In my test it worked with the text example you gave however it is not very advanced. It will simple select all characters after http:// or https:// until a white-space character occures (space, new line, tab, etc).
/(https?\:\/\/(?:[^\s]+))/gi
Here is a visual example of what would be matched from your sample string:
http://regex101.com/r/bR0yE9
You don't need to work line by line, you can search directly:
if (preg_match_all('~\bhttp://\S+~', $text, $matches))
print_r($matches);
Where \S means "all that is not a white character".There is no special internalisation problem.
Note: if you want to replace all newlines after with <br/>, I suggest to use $text = preg_replace('~\R~', '<br/>', $text);, because \R handles several type of newlines when \n will match only unix newlines.

PHP converting plain text to hashtag link

I am trying to convert user's posts (text) into hashtag clickable links, using PHP.
From what I found, hashtags should only contain alpha-numeric characters.
$text = 'Testing#one #two #three.test';
$text = preg_replace('/#([0-9a-zA-Z]+)/i', '#$1', $text);
It places links on all (#one #two #three), but I think the #one should not be converted, because it is next to another alpha-numeric character, how to adjust the reg-ex to fix that ?
The 3rd one is also OK, it matches just #three, which I think is correct.
You could modify your regex to include a negative lookbehind for a non-whitespace character, like so:
(?<!\S)#([0-9a-zA-Z]+)
Working regex example:
http://regex101.com/r/mR4jZ7
PHP:
$text = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/', '#$1', $text);
Edit:
And to make the expression compatible with other languages (non-english characters):
(?<!\S)#([0-9\p{L}]+)
Working example:
https://regex101.com/r/Pquem3/1
With uni-code, html encoded safe and joined regexp; ~(?<!&)#([\pL\d]+)~u
Here some's tags like #tag1 #tag2#tag3 etc.
Finally I have found the solution like: facebook or others hashtag to url solutions, it may be help you too. This code also works with unicode. I have used some of Bangla Unicode, let me know other Languages work as well, I think it will work on any language.
$str = '#Your Text #Unicode #ফ্রিকেলস বা #তিল মেলানিনের #অতিরিক্ত উৎপাদনের জন‍্য হয় যা #সূর্যালোকে #বাড়ে';
$regex = '/(?<!\S)#([0-9a-zA-Z\p{L}\p{M}]+)/mu';
$text = preg_replace($regex, '#$1', $str);
echo $text;
To catch the second and third hashtags without the first one, you need to specify that the hashtag should start at the beginning of the line, or be preceded one of more characters of whitespace as follows:
$text = 'Testing#one #two #three.test';
$text = preg_replace('/(^|\s+)#([0-9a-zA-Z]+)(\b|$)/', '$1#$2', $text);
The \b in the third group defines a word boundary, which allows the pattern to match #three when it is immediately followed by a non-word character.
Edit: MElliott's answer above is more efficient, for the record.

PHP Regular Expression for Bengali Word/Sentence

I am developing a web application using PHP 5.3.x. Everything is working fine, but unable to solve an issue due to regular expression problem with Bengali Punctuation. Following is my code:
$value = '\u09AC\u09BE\u0982\u09B2\u09BE\u09A6\u09C7\u09B6';
$value = mb_convert_encoding($value, 'UTF-8', 'UTF-16BE');
//$value = 'বাংলাদেশ';
//$value = 'Bangladesh';
$pattern = '/^[\p{Bengali}]{0,100}$/';
//$pattern = '/^[\p{Latin}]{0,45}$/';
echo preg_match($pattern, $value);
Whether I pass Bengali word or not, it always returns false. In JavaEE application I used this Regular Expression
\p{InBengali}
But in PHP it not working! Anyways how do I solve this problem?
Maybe this will help you:
The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
From regex in Unicode
Just append u with the expression as following
$value = 'বাংলাদেশ';
//$pattern = '/^[\p{Bengali}]{0,100}$'; wrong
$pattern = '/^[\p{Bengali}]{0,100}$/u'; //right
echo preg_match($pattern, $value);
Those are facing problem like me could be enjoy with us.

How to convert URLs containing Unicode characters into clickable links?

I use this function to make URLs to clickable links but the problem is that when there is some Unicode character in the URL it becomes clickable links only before that character...
Function:
function clickable($text) {
$text = eregi_replace('(((f|ht){1}tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+)',
'<a class="und" href="\\1">\\1</a>', $text);
$text = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)',
'\\1\\2', $text);
$text = eregi_replace('([_\.0-9a-z-]+#([0-9a-z][0-9a-z-]+\.)+[a-z]{2,3})',
'\\1', $text);
return $text;
}
How to fix this problem?
First of all, don't use eregi_replace. I don't think it's possible to use it with unicode - and it's depreciated from php 5.3. Use preg_replace.
Then you can try something like that
preg_replace("/(https?|ftps?|mailto):\/\/([-\w\p{L}\.]+)+(:\d+)?(\/([\w\p{L}\/_\.#]*(\?\S+)?)?)?/u", '$0
EDIT - updated expression to include # character
Try using \p{L} instead of a-zA-Z and \p{Ll} instead of a-z
You can find details of unicode handling in regular expressions here
And get in the habit of using the preg functions rather than the deprecated ereg functions

Categories