preg_replace / strip only two uppercase letters when together - php

I have a unique thing here when in a sentence I have two uppercase letters together. Like this:
The AA batteries.
The CA power.
The WV chronicles.
etc.
How do you I strip just those uppercase letters? Thanks!!

Here ya go
preg_match_all("/\b([A-Z]{2})\b/",$string,$matches);
/*
--matches
[0] => The AA batteries. The CA power. The WV chronicles.
[1] => AA
[2] => CA
[3] => WV
*/
https://regex101.com/r/rT3jK7/1
EDIT - Edited to use preg_match_all instead of preg_match

Related

Regex: match ":" and "-" but don't match "I-"

Ohare:Montrose:I-290 Circle:IL:IL
Ohare-Montrose-I_290-Circle-IL-IL
EB:Kennedy Expy:O'Hare:IL-43 (Harlem Ave):IL:IL
NB:I-894/US-45:Hale Interchange:Zoo Interchange:WI:IL
NB
I-894/US-45
Hale
Interchange
Zoo Interchange
WI
IL
WB:Indiana-East-West:Eastpoint:Middlebury:IN:25:IL
WB
Indiana-East-West
Eastpoint
Middlebury
IN
25
IL
Trying to extract words from two different sources that use different conventions.
Using regex for that, I cannot create one regex that deals with both options.
If I try to extract using : or - then the first one gets extracted as
Ohare, Montrose, I, 290 Circle, IL, IL
How can I get a regex to split on : or - but ignore I- or ignore 'IL-', 'US-', 'Indiana-East-West' and many other that I may find?
What I have so far but not working as I want
Regex
You can use this negative lookbehind regex:
(?:(?:IL?|US)-|Indiana-East-West)(*SKIP)(*F)|[:-]
RegEx Demo
Example Code:
$s = 'NB:I-894/US-45:Hale Interchange:Zoo Interchange:WI:IL';
print_r(preg_split('/(?:(?:IL?|US)-|Indiana-East-West)(*SKIP)(*F)|[:-]/' , $s));
Array
(
[0] => NB
[1] => I-894/US-45
[2] => Hale Interchange
[3] => Zoo Interchange
[4] => WI
[5] => IL
)

Separating a few things with preg_split

For the life of me, I can't figure out how to write the regex to split this.
Lets say we have the sample text:
15HGH(Whatever)ASD
I would like to break it down into the following groups (numbers, letters by themselves, and parenthesis contents)
15
H
G
H
Whatever
A
S
D
It can have any combination of the above such as:
15HGH
12ABCD
ABCD(Whatever)(test)
So far, I have gotten it to break apart either the numbers/letters or just the parenthesis part broken away. For example, in this case:
<?php print_r(preg_split( "/(\(|\))/", "5(Test)(testing)")); ?>
It will give me
Array
(
[0] => 5
[1] => Test
[2] => testing
)
I am not really sure what to put in the regex to match on only numbers and individual characters when combined. Any suggestions?
I don't know if preg_match_all satisfying you:
$text = '15HGH(Whatever)ASD';
preg_match_all("/([a-z]+)(?=\))|[0-9]+|([a-z])/i", $text, $out);
echo '<pre>';
print_r($out[0]);
Array
(
[0] => 15
[1] => H
[2] => G
[3] => H
[4] => Whatever
[5] => A
[6] => S
[7] => D
)
I've got this: Example (I don't know how is written the \n) but the substitution is working.
(\d+|\w|\([^)]++\)) Not too much to explain, first tries to get a number, then a char, and if there's nothing there, tries to get a whole word between parentheses. (They can't be nested)
Check this out using preg_match_all():
$string = '15HGH(Whatever)(Whatever)ASD';
preg_match_all('/\(([^\)]+)\)|(\d+)|([a-z])/i', $string, $matches);
$results = array_merge(array_filter($matches[1]),array_filter($matches[2]),array_filter($matches[3]));
print_r($results);
\(([^\)]+)\) --> Matches everything between parenthesis
\d+ --> Numbers only
[a-z] --> Single letters only
i --> Case insensitive

How to split text into Unicode words with Regular Expression in PHP

I have a web site module which collects some tweets from twitter and splits them as words to put into a database. However, as the tweets usually have Turkish characters [ıöüğşçİÖÜĞŞÇ], my module cannot divide the words correctly.
For example, the phrase Aynı labda çalıştığım is split into Ayn, labda and alıştığım, but it should have been split into Aynı, labda and çalıştığım
Here's my code which does the job:
preg_match_all('/(\A|\b)[A-Z\Ç\Ö\Ş\İ\Ğ\Ü]?[a-z\ç\ö\ş\ı\ğ\ü]+(\Z|\b)/u', $text,$a);
What do you think is wrong here?
Important Note: I'm not stupid not to split text by the space character, I need exactly these characters to match. I don't want any numerical or special character such as [,.!##$^&*123456780].
I need a regular expression that will split this
kısa isimleri ile "Vic" ve "Wick" vardı.
into this:
kısa
isimleri
ile
Vic
ve
Wick
vardı
More examples:
We're #test would be
We
re
test
Föö bär, we're #test to0 ÅÄÖ - 123 ok? kthxbai? is split into this,
b
r
we
re
test
ok
kthxbai
but I want it to be:
Föö
bär
we
re
test
ÅÄÖ
ok
kthxbai
I would take a look at mb_split().
$str = 'We\'re #test Aynı labda çalıştığım';
var_dump(\mb_split('\s', $str));
Gives me:
array
0 => string 'We're' (length=5)
1 => string '#test' (length=5)
2 => string 'Aynı' (length=5)
3 => string 'labda' (length=5)
4 => string 'çalıştığım' (length=16)
This expression would give you the desired result (according to your examples):
/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u
\pL matches any unicode letter. The lookarounds are needed to make sure it isn't followed or preceded by numbers, to completely exclude words containing any numbers.
Example:
$str = "Aynı, labda - çalıştığım? \"quote\". Föö bär, we're #test to0 ÅÄÖ - 123 ok? kthxbai?";
preg_match_all('/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u', $str, $m);
print_r($m);
Output:
Array
(
[0] => Array
(
[0] => Aynı
[1] => labda
[2] => çalıştığım
[3] => quote
[4] => Föö
[5] => bär
[6] => we
[7] => re
[8] => test
[9] => ÅÄÖ
[10] => ok
[11] => kthxbai
)
)
Just match for any non-space character placed between word boundries.
preg_match_all('/\b(\S+)\b/', $text, $a);
This way, it doesn't matter what characters are inside, as long as it's not a space, it'll match it.

Difficulty with preg_match

I'm having some difficulty with preg_match. I'm trying to match roman numerals, like this:
$string='This is roman XI and some other ones: XMCIII, like this.XXVIII'."\n";
preg_match('/(\s|\.)M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\s/',$string,$matches);
print_r($matches);
It should match any roman numeral preceded with whitespace or period and ending with whitespace. But it returns the following:
Array
(
[0] => XI
[1] =>
[2] =>
[3] => X
[4] => I
)
You have {0, 4} or {0,3} ranges in regex which means that those parts are optional. You get spaces because space[nothing]space becomes a valid match.
You can simply filter out the empty space results from your array using array_filter

Break up a string by words an punctionation

To split up a string, I come up with...
<php
preg_match_all('/(\w)|(,.!?;)/', "I'm a little teapot, short and stout.", $matches);
print_r($matches[0]);
I thought this would separate each word (\w) and the specified punctuation (,.!?;).
For example: ["I'm", "a", "little", "teapot", ",", "short", "and", "stout", "."]
Instead I get:
Array
(
[0] => I
[1] => m
[2] => a
[3] => l
[4] => i
[5] => t
[6] => t
[7] => l
[8] => e
[9] => t
[10] => e
[11] => a
[12] => p
[13] => o
etc...
What am I doing wrong here?
Thanks in advance.
You have two faults:
The \w matches only a single character. You want to match multiple by \w+. Furthermore \w matches only alphanumeric characters. If you want to match other characters like ' you will need to include them: [\w'].
The (,.!?;) matches the character sequence ,.!?;. Instead you want to match any of these characters using [,.!?;].
The correct regex is:
'/[\w\']+|[,.!?;]/'
If you want to be more permissive you should use unicode character classes instead (allows letters, numbers, combining marks, dash characters and the apostrophe for words and punctuation for punctuation):
'/[\pL\pN\pM\pPd\']+|\pP/u'
Try this - sure it works as you want:
([\w]+)|[,.!?;]+
Also want to share with you one very useful service - online regex tester
You may want to try something like:
/([^,.!?; ]+)|(,.!?;)/

Categories