We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture
Related
I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']
I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?
First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead
I want to split a string only at white spaces that does not have a certain delimiter (: in my case) before it. E.g.:
$string = "Time: 10:40 Request: page.php Action: whatever this is Refer: Facebook";
Then from something like this I want to achieve an array such that:
$array = ["Time: 10:40", "Request: page.php", "Action: whatever this is", "Refer: Facebook"];
I've tried the following so far:
$split = preg_split('/(:){0}\s/', $visit);
But this is still splitting at every occurence of a white space.
Edit: I think I asked the wrong question, however "whatever this is" should stay as a single string
Edit 2: The bits before the colons are known and stay the same, maybe incorporating those somehow makes the task easier (of not splitting at whitespace characters in strings that should stay together)?
You can use a lookahead in your split regex:
/\h+(?=[A-Z][a-z]*: )/
RegEx Demo
Regex \h+(?=[A-Z][a-z]*: ) matches 1+ whitespaces that is followed by a word starting with upper case letter and a colon and space.
you can do it
$string = "Time: 10:40 Request: page.php Action: whatever this is Refer: Facebook";
$split = preg_split('/\h+(?=[A-Z][a-z]*:)/', $string);
dd($split);
Another option could be to match what is before the colon and then match upon the next part that starts with a space, non whitespace chars and colon:
\S+:\h+.*?(?=\h+\S+:)\K\h+
\S+: Match 1+ times a non whitespace char
\h+ Match 1+ times a horizontal whitespace char
.*? Match any char except a newline non greedy
(?=\h+\S+:) Positive lookahead, assert what is on the right is 1+ horizontal whitespace chars, 1+ non whitespace chars and a colon
\K\h+ Forget what was matched using \K and match 1+ horizontal whitespace chars
Regex demo | php demo
I have some data that will be one of the following
Word Number Word Number
Word Number Word Word Number
Word Word Number Word Number
Word Word Number Word Word Number
I would like to extract the Word(s) up until the numbers, and the numbers. Here is what I have at the moment (which looks OK to me, but I don't fully understand regex).
preg_match('/([A-Za-z ])([0-9])([A-Za-z ])([0-9])/', $game, $info);
print_r($info);
However, the array is empty. I know I've seen ^ and + and $ used before but I'm not quite sure how to work it into the regex.
In order to match the strings with the format you described, you need
preg_match_all('/^([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)\s+([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)$/im', $game, $info);
See the regex demo
IDEONE demo:
$re = '~^([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)\s+([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)$~im';
$game = "Word 123 Word 456\nWord 1234 Word Word 3456\nWord Word 3455 Word 4566\nWord Word 4434 Word Word 44332";
preg_match_all($re, $game, $info);
print_r($info);
The regex explanation:
^ - start of string
([a-z]+(?:\s+[a-z]+)?) - Group 1 for Word Word or Word pattern
\s+ - one or more whitespaces
([0-9]+) - Group 2 for Number
\s+ - one or more whitespaces
([a-z]+(?:\s+[a-z]+)?) - Group 3 for Word Word or Word pattern
\s+ - one or more whitespaces
([0-9]+) - Group 4 for Number pattern
$ - end of string
The /i modifier makes the pattern case-insensitive. /m modifier is used for testing only (it makes ^ and $ match start and end of a line, not the whole string).
The [a-z]+(?:\s+[a-z]+)? subpattern means *match one or more letters with [a-z]+ and then match one or zero occurrence of a sequence of one or more whitespaces (\s+) followed with one or more letters ([a-z]+). Thus, this pattern effectively matches 1 or 2 words separated with a whitespace.
I need get all matches in string, when word begins with # and then contains only alnym 0-9a-z characters. for example from this string #ww#ee x##vx #ss #aa assadd #sfsd I need get these pieces:
#ss
#aa
#sfsd
I am trying:
$str = "#ww#ee x##vx #ss #aa assadd #sfsd";
preg_match_all("#(^|\s)\#([0-9a-z]+)(\s+|$)#ui", $str, $matches);
var_dump( $matches );
But this gives only #ss
#sfsd and skips #aa.
What would be right pattern for this?
You can use the following regex
'~\B(?<!#)#([0-9a-z]+)(?:\s|$)~iu'
See the regex demo and here is an IDEONE demo:
$re = '~\B(?<!#)#([0-9a-z]+)(?:\s|$)~ui';
$str = "#ww#ee x##vx #ss #aa assadd #sfsd";
preg_match_all($re, $str, $matches);
print_r($matches);
The regex explanation:
\B - match the non-word boundary location (that is, everywhere but between ^ and \w, \w and $, \W and \w, \w and \W))
(?<!#) - fail the match if there is a # before the current location
# - a # symbol (does not have to be escaped)
([0-9a-z]+) - Group 1 (since the (...) are not escaped, they capture a subpattern and store it in a special memory slot)
(?:\s|$) - a non-capturing group (only meant to group alternatives) matching a whitespace (\s) or $.
The ~ui modifiers allow proper handling of Unicode strings (u) and make the pattern case insensitive (i).
Note that \B is forcing a non-word character to appear before #. But you do not want to match if another # precedes the #wwww-like string. Thus, we have to use the negative lookbehind (?<!#) that restricts the matches even further.
i need to format uppercase words to bold but it doesn't work if the word contains two spaces
is there any way to make regex match only with words which end with colon?
$str = "BAKA NO TEST: hey";
$str = preg_replace('~[A-Z]{4,}\s[A-Z]\s{2,}(?:\s[A-Z]{4,})?:?~', '<b>$0</b>', $str);
output: <b>BAKA NO TEST:</b> hey
but it returns <b>BAKA</b> NO TEST: hey
the original $str is a multiline text so there are many lowercase and uppercase words but i need to change only some
You can do it like this:
$txt = preg_replace('~[A-Z]+(?:\s[A-Z]+)*:~', '<b>$0</b>', $txt);
Explanations:
[A-Z]+ # uppercase letter one or more times
(?: # open a non capturing group
\s # a white character (space, tab, newline,...)
[A-Z]+ #
)* # close the group and repeat it zero or more times
If you want a more tolerant pattern you can replace \s by \s+ to allow more than one space between each words.
Unless you have some good reason to use that regexp, try something simpler, like:
/([A-Z\s]+):/
Also, just so you know, you can use asterisk to specify none or more space characters: \s*