Preg_match for different language URLs - php

I have some text like this :
$text = "Some thing is there http://example.com/جميع-وظائف-فى-السليمانية
http://www.example.com/جميع-وظائف-فى-السليمانية nothing is there
Check me http://example.com/test/for_me first
testing http://www.example.com/test/for_me the url
Should be test http://www.example.com/翻译-英语教师-中文教师-外贸跟单
simple text";
I need to preg_match the URL, but they are of different languages.
So, I need to get the URL itself, from each line.
I was doing like this :
$text = preg_replace("/[\n]/", " <br>", $text);
$lines = explode("<br>", $text);
foreach($line as $textLine){
if (preg_match("/(http\:\/\/(.*))/", $textLine, $match )) {
// some code
// Here I need the url
}
}
My current regex is /(http\:\/\/(.*))/, please suggest how I can make this compatible with the URLs in different languages?

A regular expression like this may work for you?
In my test it worked with the text example you gave however it is not very advanced. It will simple select all characters after http:// or https:// until a white-space character occures (space, new line, tab, etc).
/(https?\:\/\/(?:[^\s]+))/gi
Here is a visual example of what would be matched from your sample string:
http://regex101.com/r/bR0yE9

You don't need to work line by line, you can search directly:
if (preg_match_all('~\bhttp://\S+~', $text, $matches))
print_r($matches);
Where \S means "all that is not a white character".There is no special internalisation problem.
Note: if you want to replace all newlines after with <br/>, I suggest to use $text = preg_replace('~\R~', '<br/>', $text);, because \R handles several type of newlines when \n will match only unix newlines.

Related

using regex for filtering some words in persian in php

I'm working on a script that is going to identify offensive words from text messages. The problem is that sometimes users make some changes in words and make them unidentifiable. my code has to be able to identify those too as far as possible.
First of all I replace all non-alnum chars to spaces.
And then:
I've written two regex patterns.
One to remove repeating characters from string.
for Example: the user has written: seeeeex, it replaces it with sex:
preg_replace('/(.)\1+/', '$1', $text)
this regex works fine for English words but not in Farsi words which is my case.
for example if you write:
امیییییییییین
it does nothing with it.
I also tried
mb_ereg_replace
But it didn't work either.
My other regex is to remove spaces around all one-letter words.
for example: I want it to convert S E X to sex:
preg_replace('/( [a-zA-Zآ-ی] )\1+/', trim('$1'), $text);
This regex doesn't work at all and needs to be corrected.
Thank you for your help
Working with multi-byte characters, you should enable Unicode Aware modifier to change behavior of tokens in order to match right thing. In your first case it should be:
/(.)\1+/u
In your second regex, however, I see both syntax and semantic errors which you would change it to:
/\b(\pL)\s+/u
PHP:
preg_replace('/\b(\pL)\s+/u', '$1', $text);
Putting all together:
$text = 'سسس ککک سسس';
echo preg_replace(['/(.)\1+/u', '/\b(\pL)\s+/u'], '$1', $text); // خروجی میدهد: سکس
Live demo

preg_replace words with #

Trying to use preg_replace to find words with # in them and replace the whole word with nothing.
<?php
$text = "This is a text #removethis little more text";
$textreplaced = preg_replace('/#*. /', '', $text);
echo $captions;
Should output: This is a text little more text
Been trying to google on special charc and such but am lost.
Use \w:
$textreplaced = preg_replace('/#[\w]+ /', '', $text);
echo $textreplaced;
Only finding one at character at a time
I believe you are only finding the '#' to begin with, but if you find the whole string inside use \b around the regex so your final regex should be something like /(#).{2,}?\b/.
The ? mark is important because regexes are greedy and grab as many letters as posible
Just a tip visit a tester like regexpal

PHP converting plain text to hashtag link

I am trying to convert user's posts (text) into hashtag clickable links, using PHP.
From what I found, hashtags should only contain alpha-numeric characters.
$text = 'Testing#one #two #three.test';
$text = preg_replace('/#([0-9a-zA-Z]+)/i', '#$1', $text);
It places links on all (#one #two #three), but I think the #one should not be converted, because it is next to another alpha-numeric character, how to adjust the reg-ex to fix that ?
The 3rd one is also OK, it matches just #three, which I think is correct.
You could modify your regex to include a negative lookbehind for a non-whitespace character, like so:
(?<!\S)#([0-9a-zA-Z]+)
Working regex example:
http://regex101.com/r/mR4jZ7
PHP:
$text = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/', '#$1', $text);
Edit:
And to make the expression compatible with other languages (non-english characters):
(?<!\S)#([0-9\p{L}]+)
Working example:
https://regex101.com/r/Pquem3/1
With uni-code, html encoded safe and joined regexp; ~(?<!&)#([\pL\d]+)~u
Here some's tags like #tag1 #tag2#tag3 etc.
Finally I have found the solution like: facebook or others hashtag to url solutions, it may be help you too. This code also works with unicode. I have used some of Bangla Unicode, let me know other Languages work as well, I think it will work on any language.
$str = '#Your Text #Unicode #ফ্রিকেলস বা #তিল মেলানিনের #অতিরিক্ত উৎপাদনের জন‍্য হয় যা #সূর্যালোকে #বাড়ে';
$regex = '/(?<!\S)#([0-9a-zA-Z\p{L}\p{M}]+)/mu';
$text = preg_replace($regex, '#$1', $str);
echo $text;
To catch the second and third hashtags without the first one, you need to specify that the hashtag should start at the beginning of the line, or be preceded one of more characters of whitespace as follows:
$text = 'Testing#one #two #three.test';
$text = preg_replace('/(^|\s+)#([0-9a-zA-Z]+)(\b|$)/', '$1#$2', $text);
The \b in the third group defines a word boundary, which allows the pattern to match #three when it is immediately followed by a non-word character.
Edit: MElliott's answer above is more efficient, for the record.

Allow space into my Regex

I cannot find a way to allow a space in this regex for extract between title tag
<title>my exemple</title>
here is the regex
$pattern = "/<title>(.+)<\/title>/i";
I tried
/<title>(.+)<\/title>/i\s
/<title>(.+)<\/title>/i\S
/<title>\s(.+)<\/title>/i
/<title>(.+)\s<\/title>/i
here is the full fonction
function getSiteTitle(){
$RefURL = (is_null($_SERVER['HTTP_REFERER'])) ? 'Un know' : $_SERVER['HTTP_REFERER'];
if($RefURL != 'Un know'){
$con = file_get_contents($RefURL) or die (" can't open URL referer ");
$pattern = "/<title>(.+)<\/title>/i";
preg_match($pattern,$con,$match);
$result = array($match[1],$RefURL);
return $result;
i have verified that i receive a keyword in my referer , because it work petty well with keywords without space
thx you
If you want to capture HTML on multiple lines (is that what you mean by "spaces"?), you'll need to turn on the s modifier, which allows the . character to match newline characters, as well.
This should work:
/<title>(.+)<\/title>/is
How about
$pattern = "/<title>\s*(.+)\s*<\/title>/i";
then the first capturing group will contain only the keyword, which may contain spaces, like:
<title> key word </title>
// result is "key word"
add the s modifier to the end (/.../is) if you want to allow newlines inside title as well.
If I got what you want right, you could also use this approach:
$pattern = "/<title>(.+)<\/title>/is";
and then trim the first capturing group.
Selecting text between title text and the tags as well:
/<title>(.+)<\/title>/
Doing the same even if they are spread over multiple lines:
/<title>(.+)<\/title>/s
Doing the same as above but ignoring cases (lower or upper case doesn't matter)
/<title>(.+)<\/title>/is
Now we are using lookbehind and lookahead in order to only select the text between the tags:
/(?<=<title>)(.+)(?=<\/title>)/is
Please change the flags (i and s) the way you need them.
If that doesn't solve your problem I don't know what will :)
Here you can see an example of how my last regex would work: http://regexr.com?37ukf
EDIT:
Ok, try to test this code somehere:
<?php
$title = '<title> My Example </title>';
preg_match('/(?<=<title>)(.+)(?=<\/title>)/is', $title, $match);
var_dump($match);
?>
You'll see that it works perfectly fine. Now with this knowledge go ahead and check if $con truly looks the way you think it should. And do a var_dump of your $matches instead of looking for specific indices.

Converting links occuring inside a string

I am attempting to change a string occurance e.g. http://www.bbc.co.uk/ so that it appears inside a html link e.g. http://www.bbc.co.uk
however for some reason my regex conversion does not work. Can someone please point me in the correct direction?
$text = "I love this website http://www.bbc.co.uk/";
$x = preg_replace("#[a-z]+://[^<>\s]+[[a-z0-9]/]#i", "\\0", $text);
var_dump($x);
outputs I love this website http://www.bbc.co.uk/ (No html link)
Your weird character class is at fault:
[[a-z0-9]/]
Double square brackets are for POSIX character classes like [[:digit:]].
You meant to write just:
[a-z0-9/]
It is because you regex is giving you a match (in fact it's really not even close to giving you a match as you are not accepting periods in the domain name at all). Try something like this:
$pattern = '#https?://.*\b#i';
$replace = '$0';
$x = preg_replace($pattern, $replace, $text);
Note that I am not actually trying to validate the URL format here, so I just accept anything like http():// up to the next word boundary. It didn't seem as if you were going for a true URL validation regex anyway (i.e. validating there is at least one ., that the TLD component has 2-6 characters, etc.), so I just figure I would give you the simplest pattern that would match.
Use this:
$x = preg_replace('#http://[?=&a-z0-9._/-]+#i', '<a target="_blank" href="$0">$0</a>', $text);

Categories