Finding hashtags in Text

Finding hashtags in Text - php

Yes, there are lots of hashtag regex available here but none is suiting my needs. And no one is actually able to solve the problem.
The Regex should consider the following hashtags as valid:
#validhashtag
#valid_hashtag
#validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
...and not valid shoulw be:
ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid
Allowed Characters should be:
a-Z,0-9,öÖäÄüÜß,_
Max length should be 50 characters.
The main problem is the part where the hashtags is "connected" to another textpart. I don't know how to solve that problem.
This is what I attempted to do
/([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})/u
That one works pretty well but doesn't consider the "word#hashtag" - Problem.

I think your original expression is pretty great, we'd just modify that with:
^\s*#([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})$
Demo
Test
$re = '/^\s*#([\p{Pc}\p{N}\p{L}\p{Mn}]{1,50})$/um';
$str = '#validhashtag
#valid_hashtag
#validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Output
array(4) {
[0]=>
array(2) {
[0]=>
string(13) "#validhashtag"
[1]=>
string(12) "validhashtag"
}
[1]=>
array(2) {
[0]=>
string(14) "#valid_hashtag"
[1]=>
string(13) "valid_hashtag"
}
[2]=>
array(2) {
[0]=>
string(41) " #validhashtag_with_space_before_or_after"
[1]=>
string(39) "validhashtag_with_space_before_or_after"
}
[3]=>
array(2) {
[0]=>
string(35) "#valid_hashtag_chars_öÖäÄüÜß"
[1]=>
string(34) "valid_hashtag_chars_öÖäÄüÜß"
}
}

You may use either of the two below:
/(?<!\S)#\w+(?!\S)/u
/(?<!\S)#[\w\p{M}\p{Pc}]+(?!\S)/u
See the regex demo. If you want to restrict the word part length, keep your {1,50} quantifier - /(?<!\S)#\w{1,50}(?!\S)/u.
Also note: \w even with u modifier does not match the same chars that are are considered "word" in .NET, Java, Python re regex. You may decide to include other classes to fill the gap and use [\w\p{M}\p{Pc}]+ instead of just \w where \p{M} matches any diacritics and \p{Pc} matches any connector punctuation.
Details
(?<!\S) - a whitespace or start of string required right before
# - a # sign
\w+ - 1+ word chars (NOTE if you want to restrict its length from 1 to 50, replace + with {1,50}) (also, note that u modifier lets the PCRE engine to match any Unicode letters and digits with \w shorthand)
[\w\p{M}\p{Pc}] - matches 1+ word chars + all diacritics (\p{M}) and all connector punctuation (\p{Pc}, considered as word in .NET regex)
(?!\S) - a whitespace or end of string required right after.
PHP demo:
$s = "#validhashtag
#valid_hashtag
#validhashtag_with_space_before_or_after
#valid_hashtag_chars_öÖäÄüÜß
...and not valid shoulw be:
ipsum#notvalid //Not valid: Connected to Word
http://google.com/#results //Not valid: Same as above
#not-valid
#not!valid";
if (preg_match_all('~(?<!\S)#\w+(?!\S)~u', $s, $matches)) {
print_r($matches[0]);
}
Output:
Array
(
[0] => #validhashtag
[1] => #valid_hashtag
[2] => #validhashtag_with_space_before_or_after
[3] => #valid_hashtag_chars_öÖäÄüÜß
)

Related

Extract email:pass from string

I have text like in the example below
$text = "rami#gmail.com:Password
This email is from Gmail
email subscription is valid
omar#yahoo.com:password
this email is from yahoo
email subscription is valid ";
I want to be able to retrieve all email:password occurrence in the text without the rest of the description.
I tried preg_match but it returned 0 results and explode returns all text with the description.
Any help is greatly appreciated
Explode
Str_Pos
Preg_match
$text = "rami#gmail.com:Password
This email is from Gmail
email subscription is valid
omar#yahoo.com:password
this email is from yahoo
email subscription is valid ";

You can use regex to capture the email and passwords separately.
I capture anything of any length to a colon then anything again until new line with an optional space.
preg_match_all("/(.*#.*):(.*?)\s*\n/", $text, $matches);
$matches = array_combine(["match", "email", "password"], $matches);
var_dump($matches);
Output:
array(3) {
["match"]=>
array(2) {
[0]=>
string(24) "rami#gmail.com:Password
"
[1]=>
string(25) "omar#yahoo.com:password
"
}
["email"]=>
array(2) {
[0]=>
string(14) "rami#gmail.com"
[1]=>
string(14) "omar#yahoo.com"
}
["password"]=>
array(2) {
[0]=>
string(8) "Password"
[1]=>
string(8) "password"
}
}
https://3v4l.org/baeQ0

It's difficult to be confident/precise when dealing with unrealistic input strings, but this pattern extracts (does not validate) the email:password lines for you.
Match from the start of the line, match the known characters and in the negated character classes include whitespace characters to prevent matching the next line. You could use \n instead of \s if you like.
Code: (Demo)
$text = "rami#gmail.com:Password
This email is from Gmail
email subscription is valid
omar#yahoo.com:password
this email is from yahoo
email subscription is valid ";
var_export(preg_match_all('~^[^#\s]+#[^:\s]+:\S+~m', $text, $matches) ? $matches[0]: "none");
Output:
array (
0 => 'rami#gmail.com:Password',
1 => 'omar#yahoo.com:password',
)
...hmm, I guess it is okay to allow spaces in a password, but if so, then you cannot logically trim any spaces from the right side of the password. An alternative pattern to allow spaces which also provides separated capture groups could look like this: (See Demo with fringe case where password characters require specific pattern logic to prevent greedy matching in the first capture group.)
var_export(preg_match_all('~([^#\s]+#[^:\s]+):(.*)~', $text, $matches, PREG_SET_ORDER) ? $matches: "none");
I am favoring negated character classes [^...] over . (any character dot) because it allows the use of greedy quantifiers -- this affords the pattern greater efficiency (in terms of step count, anyhow).

Php preg_match issue not working

I am trying to find a php preg_match that can match:
"2-20 to 2-25"
from this text:
user levels 2-20 to 2-25 not ready
I tried
preg_match("/([0-9]+) to ([0-9]+)/", $vars[1] , $matchesto);
but the result is:
"20 to 2"
Any help appreciated.

Your pattern is almost correct; just include the dashes and adjust the capture group:
([-0-9]+ to [-0-9]+)
Example:
https://regex101.com/r/eD6lQ2/1

Thats because [0-9]+ matches one or more numbers but won't match a hyphen (-).
Try this:
$pattern = '~([0-9]+-[0-9]+) to ([0-9]+-[0-9]+)~Ui';
preg_match($pattern, $vars[1] , $matchesto);

You can use "\d" to match the digits:
<?php
$str = 'user levels 2-20 to 2-25 not ready';
$matches = array();
preg_match('/(\d+-\d+) to (\d+-\d+)/', $str, $matches);
var_dump($matches);
Output:
array(3) {
[0]=>
string(12) "2-20 to 2-25"
[1]=>
string(4) "2-20"
[2]=>
string(4) "2-25"
}

PHP Split String after specific occurances

I have the following string I'm trying to split into different variables based on specfic occurneces
Brodel8DARK HORSE COMICS
I'd like my end result to be
$user = Brodel
$index = 8
$publisher = DARK HORSE COMICS
I've tried playing around with some reg expressions but I'm a novice
This conditions will always be true
The user name will change (different number of Characters etc..)
The index will always be an integer but can grow to 3+ digits
The Publisher will always be in all caps
Thanks for any help

As long as the publisher doesn't start with a number, then this regex should work
/^([A-Za-z]+)(\d+)([A-Z\s]+)$/
It's 0+ number of characters followed by 0+ digits and finally 0+ capital letters.
<?php
$string = 'Brodel8DARK HORSE COMICS';
if(preg_match('/^([A-Za-z]+)(\d+)([A-Z\s]+)$/', $string, $matches) === 1){
var_dump($matches);
}
This outputs:
array(4) {
[0]=>
string(24) "Brodel8DARK HORSE COMICS"
[1]=>
string(6) "Brodel"
[2]=>
string(1) "8"
[3]=>
string(17) "DARK HORSE COMICS"
}

try this:
<?php
$string = 'Brodel8DARK HORSE COMICS';
preg_match("/^([^\d]+)(\d+)([A-Z\s]+)$/", $string, $match);
//print_r($match);
echo $publisher = $match[3];//DARK HORSE COMICS
?>

Matching any amount of words regular expression

I'm trying to capture a line with n-number of words that follow a title sequence in PHP, but I cannot capture anything more than the first word. Here are the contents of the file that I am trying to match:
Name: test
Caption: test test test test
And here is the regular expression code and results...
preg_match_all('/([A-z]+:)\s*(\w+)[\r|\r\n|\n]*/', $contents, $array);
Results:
array(3) {
[0]=> array(2) {
[0]=> string(11) "Name: test "
[1]=> string(14) "Caption: test "
}
[1]=> array(2) {
[0]=> string(5) "Name:"
[1]=> string(8) "Caption:"
}
[2]=> array(2) {
[0]=> string(4) "test"
[1]=> string(4) "test"
}
}
Any help would be greatly appreciated.

Assuming that your input data always looks like your example (title segment, colon, words; all on a single line), this should do it:
preg_match_all('/([A-Za-z]+:)\s*(.*)/', $contents, $array);
This would result in $array[1] matching something like Name:, and then $array[2] would match the rest of the line (you may have to use trim() to strip any leading and/or trailing white space from $array[2]).
If you only want to capture "words" in the second part, I believe you could change the second capture group to something like:
preg_match_all('/([A-Za-z]+:)\s*([\w\s]+)/', $contents, $array);
Note also that you shouldn't use the [A-z] construct, since there are non-alphabetical characters in the ASCII table between the upper case letters and the lower case letters. See the ASCII Table for a character map.

php - regex pattern

I need to use a regex pattern , but what is the right php "decode" . my pattern is "similar" to BBcode i.e. ['something'] the 'something' could be "any length" but realistically I doubt not more than 10 chars/numbers. What is the correct php syntax to "unscrambe" i.e.
if ($row->xyz =['something'] ):
do this
else:
do that
endif;
Thanks in advance

A basic regexp to match BBCode style tags would look something like this:
preg_match('/\[[\/]?[A-Za-z0-9]+\]/', $row->xyz)
That will match anything that starts with a "[", ends with a "]", and has one or more alphanumeric characters in the middle (with an optional "/" for an end-tag.) Note it has flaws - for example, if you have a nested "[...]" in a larger "[...]", it will only grab the inner one. (i.e. [foo[bar]] will return only "[bar]".)
Example:
<?php
$regexp = '/\[[\/]?[A-Za-z0-9]+\]/';
$testString = '[i]An italic string with some [b]bold[/b] text.[/i]';
preg_match_all($regexp, $testString, $result);
print_r($result);
?>
Result:
array(1) {
[0]=> array(4) {
[0]=> string(3) "[i]"
[1]=> string(3) "[b]"
[2]=> string(4) "[/b]"
[3]=> string(4) "[/i]"
}
}
Of course, I'm not sure this is what you actually mean you want to do, but it is what you say you want to do. Are you sure you want to find BBCodes, rather than find strings that are wrapped in them?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Finding hashtags in Text - php

Related

Extract email:pass from string

Php preg_match issue not working

PHP Split String after specific occurances

Matching any amount of words regular expression

php - regex pattern

Categories

Resources