It's a basic preg_replace that detects phone numbers (and just long numbers). My problem is I want to avoid detecting numbers between double "", single '' and forward slashes //
$text = preg_replace("/(\+?[\d-\(\)\s]{8,25}[0-9]?\d)/", "<strong>$1</strong>", $text);
I poked around but nothing is working for me. Your help will be appreciated.
I predict that your pattern is going to let you down more than it is going to satisfy you (or you are very comfortable with "over-matching" within the scope of your project).
While my suggestion really blows out the pattern length, a (*SKIP)(*FAIL) technique will serve you well enough by consuming and discarding the substrings that require disqualification. There may be a way of dictating the pattern logic with lookaround instead, but with an initial pattern with so many potential holes in it and no sample data, there are just too many variables to make a confident suggestion.
Regex101 Demo
Code: (Demo)
$text = <<<TEXT
A number 555555555 then some more text and a quoted number "(123)4567890" and
then 1 2 3 4 6 (54) 3 -2 and forward slashed /+--------0/ versus
+--------0 then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "+012345678901/ which of course is a _necessary_ check?
TEXT;
echo preg_replace(
'~([\'"/])\+?[\d()\s-]{8,25}\d{1,2}\1(*SKIP)(*FAIL)|((?!\s)\+?[\d()\s-]{8,25}\d{1,2})~',
"<strong>$2</strong>",
$text);
Output:
A number <strong>555555555</strong> then some more text and a quoted number "(123)4567890" and
then <strong>1 2 3 4 6 (54) 3 -2</strong> and forward slashed /+--------0/ versus
<strong>+--------0</strong> then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "<strong>+012345678901</strong>/ which of course is a _necessary_ check?
For the technical breakdown, see the Regex101 link.
Otherwise, this is effectively checking for "phone numbers" (by your initial pattern) and if they are wrapped by ', ", or / then the match is ignored and the regex engine continues looking for matches AFTER that substring. I have added (?!\s) at the start of the second usage of your phone pattern so that leading spaces are omitted from the replacement.
It seems that you're not validating, then you might be trying to write some expression with less boundaries, such as:
^\+?[0-9()\s-]{8,25}[0-9]$
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Related
I use following code to replace a number and string to a replacement text
var rule = (\d+\s((apple\b|apples\b|Apple\b|Apples\b)+))
var search_regexp = new RegExp(rule, "ig");
return masterstring.replace(search_regexp,replacetext);
input string : 10 apples are better than 100 pears
replacement: 10 Oranges
Result: 10 Oranges are better than 100 pears
How is it possible to have a regular expression for handling 10 apples and Ten apples? Say one to identify
(a number in digits or word)+space+(a case insensitive word)
and replace this with 10 Oranges both using jQuery and php?
If you specifically only want to match valid number 'words' you would have to literally include in your regex all the numbers you want to include.
(one|two|three|four|five|six|seven|eight|nine|ten) etc.
This could be improved by combining words that start with the same letter:
(one|t(wo|hree|en)|f(our|ive)|s(ix|even)|eight|nine)
You can then include your \d+ as your first option:
(\d+|one|t(wo|hree|en)|f(our|ive)|s(ix|even)|eight|nine)
As some said in the comments you are using the case insensitive modifier, so I have done all lower case)
Note that if you want to go beyond ten this will become quite long, and hard to make efficient, I've had a quick go, and created a beast of a regex, I have not tried to optimise too much..
(?:
\d+
|t(?:en|hirteen)
|eleven
|twelve
|fifteen
|(?:
(?:twenty|thirty|fourty|fifty|sixty|seventy|eighty|ninety)
(?:[ -](?:one|t(?:wo|hree)|f(?:our|ive)|s(?:ix|even)|eight|nine))?
)
|(?:one|t(?:wo|hree)|f(?:our(?:teen)?|ive)|s(?:ix|even)(?:teen)?|eight(?:een)?|nine(?:teen)?)
)[ ]apples?
I have spread this over several lines and added the 'x' modifier in the online example - this makes it much easier to read, this works in PHP but not in javascript, you would have to remove the newlines/whitespace to use in JS)
[https://regex101.com/r/zDYme7/1](See working example online here)
Its also worth mentioning that doing this in regex may not be the best way - a string tokenizer would involve a lot less cpu time, but would involve more code.
One example of a tokenizer: https://www.npmjs.com/package/tokenize-text
I am trying to detect with regex, strings that have a pattern of {any_number}{x-}{large|medium|small} for a site with clothing I am building in PHP.
I have managed to match the sizes against a preconfigured set of strings by using:
$searchFor = '7x-large';
$regex = '/\b'.$searchFor.'\b/';
//Basically, it's finding the letters
//surrounded by a word-boundary (the \b bits).
//So, to find the position:
preg_match($regex, $opt_name, $match, PREG_OFFSET_CAPTURE);
I even managed to detect weird sizes like 41 1/2 with regex, but I am not an expert and I am having a hard time on this.
I have come up with
preg_match("/^(?<![\/\d])([xX\-])(large|medium|small)$/", '7x-large', $match);
but it won't work.
Could you pinpoint what I am doing wrong?
It sounds like you also want to match half sizes. You can use something like this:
$theregex = '~(?i)^\d+(?:\.5)?x-(?:large|medium|small)$~';
if (preg_match($theregex, $yourstring,$m)) {
// Yes! It matches!
// the match is $m[0]
}
else { // nah, no luck...
}
Note that the (?i) makes it case-insensitive.
This also assumes you are validating that an entire string conforms to the pattern. If you want to find the pattern as a substring of a larger string, remove the ^ and $ anchors:
$theregex = '~(?i)\d+(?:\.5)?x-(?:large|medium|small)~';
Look at the specification you have and build it up piece by piece. You want "{any_number}{x-}{large|medium|small}".
"{any_number}" would be \d+. This does not allow fractional numbers such as 12.34, but the question does not specify whether they are required.
"{x-}" is a simple string x-
"{large|medium|small}" is a choice between three alternatives large|medium|small.
Joining the pieces together gives \d+x-(large|medium|small). Note the brackets around the alternation, without then the expression would be interpreted as (\d+x-large)|medium|small.
You mention "weird sizes like 41 1/2" but without specifying how "weird" the number to be matched are. You need a precise specification of what you include in "weird" before you can extend the regular expression.
there have a long articles, I want only remove thousand separator, not a comma.
$str = "Last month's income is 1,022 yuan, not too bad.";
//=>Last month's income is 1022 yuan, not too bad.
preg_replace('#(\d)\,(\d)#i','???',$str);
How to write the regex patterns? Thanks
If the simplified rule "Match any comma that lies directly between digits" is good enough for you, then
preg_replace('/(?<=\d),(?=\d)/','',$str);
should do.
You could improve it by making sure that exactly three digits follow:
preg_replace('/(?<=\d),(?=\d{3}\b)/','',$str);
If you have a look at the preg_replace documentation you can see that you can write captures back in the replacement string using $n:
preg_replace('#(\d),(\d)#','$1$2',$str);
Note that there is no need to escape the comma, or to use i (as there are not letters in the pattern).
An alternative (and probably more efficient) way is to use lookarounds. These are not included in the match, so they don't have to written back:
preg_replace('#(?<=\d),(?=\d)#','',$str);
The first (\d) is represented by $1, the second (\d) by $2. Therefore the solution is to use something like this:
preg_replace('#(\d)\,(\d)#','$1$2',$str);
Actually it would be better to have 3 numbers behind the comma to avoid causing havoc in lists of numbers:
preg_replace('#(\d)\,(\d{3})#','$1$2',$str);
I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go
I have just started learning to code both PHP as well as HTML and had a look at a few tutorials on regular expressions however have a hard time understanding what these mean. I appreciate any help.
For example, I would like to validate the email address peanuts#monkey.com. I start off with the code and I get the message invalid email address.
What am I doing wrong?
I know that the metacharacters such as ^ denote the start of a string and $ denote the end of a string however what does this mean? What is the start of a string and what is the end of a string?
When do I group regular expressions?
$emailaddress = 'peanuts#monkey.com';
if(preg_match('/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]$/', $emailaddress)) {
echo 'Great, you have a valid email address';
} else {
echo 'boo hoo, you have an invalid email address';
}
What you have written works with some small modifications if that is what you want to use, however you miss a '+' at the end.
1)
^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.[a-zA-Z0-9]+$
The caret and dollar character match positions rather than characters, ^ is equal to the beginning of line and $ is equal to the end of line, they are used to anchor your regex. If you write your regex without those two you will match email addresses everywhere in your text, not only the email addresses which is on a single line in this case. If you had written only the ^ (caret) you would have found every email address which is on the start of the line and if you had written only the $ (dollar) you would have found only the email addresses on the end of the line.
Blah blah blah someEmail#email.com
blah blah
would not give you a match because you do NOT have a email address at the beginning of line and the line does not terminate with it either so in order to match it in this context you would have to drop ^ and $.
Grouping is used for two reasons as far I know: Back referencing and... grouping. Grouping is used for the same reasons as in math, 1 + 3 * 4 is not the same as (1 + 3) * 4. You use parentheses to constrain quantifiers such as '+', '*' and '?' as well as alternation '|' etc.
You also parentheses for back referencing, but since I can't explain it better I would link you to: http://www.regular-expressions.info/brackets.html
I will encourage you to take a look at this book, even though you only read the first 2-3 chapters you will learn a lot and it is a great book! http://oreilly.com/catalog/9781565922570
And as the commentators say, this regex is not perfect but it works and show you what you had forgotten. You were not far away!
UPDATED as requested:
The '+', '*' and '?' are quantifiers. And is also a good example where you group.
'+' mean match whatever charachter preceeds it or group 1 or n times.
'*' mean match whatever charachter preceeds it 0 or n times.
'?' mean match whatever charachter preceeds it or the group 0 or 1 time.
n times meaning (indefinitely)
The reason why you use [a-zA-Z0-9]+ is without the '+' it will only match one character. With the + it will match many but it must match at least one. With * it match many but also 0, and ? will match 1 character at most but also 0.
Your regex doesn't match email addresses. Try this one:
/\b[\w\.-]+#[\w\.-]+\.\w{2,4}\b/
I recommend you read through this tutorial to learn about Regular Expressions.
Also, RegExr is great for testing them out.
As for your second question; the ^ character means that the regular expression must start matching from the first character in the string you input. The $ means that the regular expression must end at the final character in the string you input. In essence, this means that your regular expression will match the following string:
peanuts#monkey.com
but NOT the following string:
My email address is peanuts#monkey.com, and I love it!
Grouping regular expressions has lots of use cases. Using matching groups will also make your expression cleaner and more readable. It's all explained quite well in the tutorial I linked earlier.
As CanSpice points out, matching all possible email addresses isn't all that easy. Using the RFC2822 Email Validation expression will do a better job:
/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/
There are many alternatives, but even the simplest ones will do a fair job as most email addresses end in .com (or other 2-4 character length top domains).
The only reason your original expression doesn't work is that you're limiting the number of characters behind the period (.) in your expressions to 1. Changing your expression to:
/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]+$/
Will allow for an infinite amount of characters behind the last period.
/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]{2,4}$/
Will allow 2 to 4 characters behind the last period. That would match:
name#email.com
name#email.info
but not:
fake#address.suckers
The top level domain (".com," ".net," ".museum") can be from 2 to 6 characters. So you should be saying 2,6 instead of 2,4.
I wrote an extremely good email address regular expression a few years ago:
^\w+([-+._]\w+)#(\w+((-+)|.))\w{1,63}.[a-zA-Z]{2,6}$
A lot of research went into that. But I have some basic tips:
DON'T JUST COPY-PASTE! If someone says "here's a great regex for that," don't just copy paste it! Understand what's going on! Regular expressions are not that hard. And once you learn them well, it'll pay dividends forever. I got good at them by taking a class in Perl back in college. Since then, I've barely gotten any better and am WAY better than the vast majority of programmers I know. It's sad. Anyways, learn it!
Start small. Instead of building a giant regex and testing it when you're done, test just a few characters. For example, when writing an email validator, why not try \w+#\w+.\w+ and see how good that is? Add in a few more things and re-test. Like ^\w+#\w+.[A-Za-z]{2,6}$
The start and end of a regex string means that nothing can come before or after the characters you specify. Your regex string needs to account for underscores, needs capitals Zs with your capital ranges, and other adjustments.
/^[a-zA-Z_0-9]+#[a-zA-Z0-9]+\.[a-zA-z0-9]{2,4}$/
{2,4} says the top level domain is between 2 and 4 characters.
This will validate ANY email address (at least i've tried a lot )
preg_match("/^[a-z0-9._-]{2,}+\#[a-z0-9_-]{2,}+\.([a-z0-9-]{2,4}|[a-z0-9-]{2,}+\.[a-z0-9-]{2,4})$/i", $emailaddress);
Hope it works!
Make sure you ALWAYS escape metacharacters (like dot):
if(preg_match('/^[a-zA-z0-9]+#[a-zA-z0-9]+\.[a-zA-z0-9]$/', $emailaddress)) {