Extracting Urdu/Arabic phrases/sentences from a string

Extracting Urdu/Arabic phrases/sentences from a string - php

I want to extract Urdu phrases out of a user-submitted string in PHP. For this, I tried the following test code:
$pattern = "#([\x{0600}-\x{06FF}]+\s*)+#u";
if (preg_match_all($pattern, $string, $matches, PREG_SET_ORDER)) {
print_r($matches);
} else {
echo 'No matches.';
}
Now if, for example, $string contains
In his books (some of which include دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), Ibn-e-Insha has told amusing stories of his travels.
I get the following output:
Array
(
[0] => Array
(
[0] => دنیا گول ہے
[1] => ہے
)
[1] => Array
(
[0] => آوارہ گرد کی ڈائری
[1] => ڈائری
)
[2] => Array
(
[0] => ابن بطوطہ کے تعاقب میں
[1] => میں
)
)
Even though I get my desired matches (دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), I also get undesired ones (ہے, ڈائری, and میں -- each of which is actually the last word of its phrase). Can anyone please point out how I can avoid the undesired matches?

That's because the capturing group ([\x{0600}-\x{06FF}]+\s*) is matched multiple times,each time overwriting what it matched the previous time. You could get the expected output by simply converting it to a non-capturing group -- (?:[\x{0600}-\x{06FF}]+\s*) -- but here's a more correct alternative:
$pattern = "#(?:[\x{0600}-\x{06FF}]+(?:\s+[\x{0600}-\x{06FF}]+)*)#u";
The first [\x{0600}-\x{06FF}]+ matches the first word, then if there's some whitespace followed by another word, (?:\s+[\x{0600}-\x{06FF}]+)* matches it and any subsequent words. But it doesn't match any whitespace after the last word, which I presume you don't want.

Related

How to write regex to find empty space after colon in string with no new line in text format?

I am creating one regex to find words after colon in my pdftotext. i
am getting data like:
I am using this xpdf to convert uploaded pdf by user into text format.
$text1 = (new Pdf('C:\xpdf-tools-win-4.00\bin64\pdftotext.exe'))
->setPdf('path')
->setOptions(['layout', 'layout'])
->text();
$string = $text1;
$regex = '/(?<=: ).+/';
preg_match_all($regex, $string, $matches);
In ->setPdf('path') path will be path of uploaded file.
I am getting below data :
Full Name: XYZ
Nationality: Indian
Date of Birth: 1/1/1988
Permanent Residence Address:
In my Above data you can see residence address is empty.
Im writing one regex to find words after colon.
but on $matches it results only:
Current O/P:
Array
(
[0] => Array
(
[0] => xyz
[1] => Indian
[2] => 1/1/1988
)
)
It skips if regex find whitespace or empty value after colon:
I want result with empty value too in array.
Expected O/P:
Array
(
[0] => Array
(
[0] => xyz
[1] => Indian
[2] => 1/1/1988
[3] =>
)
)

Note: The OP has changed his question after several answers were given.
This is an answer to the original question.
Here is one solution, using preg_match_all. We can try matching on the following pattern:
(?<=:)[ ]*(\S*(?:[ ]+\S+)*)
This matches any amount of spaces, following a colon, the whitespace then followed by any number of words. We access the first index of the output array from preg_match_all, because we only want what was captured in the first capture group.
$input = "name: xyz\naddress: db,123,eng.\nage:\ngender: male\nother: hello world goodbye";
preg_match_all ("/(?<=:)[ ]*(\S*(?:[ ]+\S+)*)$/m", $input, $array);
print_r($array[1]);
Array
(
[0] => xyz
[1] => db,123,eng.
[2] =>
[3] => male
[4] => hello world goodbye
)
Using capture groups is a good way to go here, because the captured group, in theory, should appear in the output array, even if there is no captured term.

Your code, $regex = '/\b: \s*'\K[\w-]+/i';, ended right before \K. You have 3 quotes, and the first 2 quotes capture the pattern.
Anyways, what you can do is use groups to capture the output after the colon, including whitespace:
$regex = "^.+: (\s?.*)" should work.

regex repeated asterisk pattern matching

If I do the regex matching
preg_match('/^[*]{2}((?:[^*]|[*][^*]*[*])+?)[*]{2}(?![*]{2})/s', "**A** **B**", $matches);
I get the result for $matches I want of
Array ( [0] => **A** [1] => A )
but I am not sure how to modify the regex to yield the same result in $matches from the input text without the space in the middle, that is, "**A****B**".

It looks like the regex matching
preg_match('/^[*]{2}((?:[^*]|[*][^*]*[*])+?)[*]{2}/s', "**A****B**", $matches);
yields the result for $matches I want of
Array ( [0] => **A** [1] => A )

preg_match_all is only matching one

I am trying to get the value after the dots, and I would like to get all of them (each as their own key/value).
The following is what I am running:
$string = "div.cat.dog#mouse";
preg_match_all("/\.(.+?)(\.|#|$)/", $string, $matches);
and when I do a dump of $matches I am getting this:
Array
(
[0] => Array
(
[0] => .cat.
)
[1] => Array
(
[0] => cat
)
[2] => Array
(
[0] => .
)
)
Where item [1] is, it is only returning 1 value. What I was expecting was for it to return (for this case) 2 items cat and dog. How come dog isn't getting picked up by preg_match_all?

Use lookahead:
\.(.+?)(?=\.|#|$)
RegEx Demo
Problem in your regex is that you're matching DOT on LHS and a DOT or HASH or end of input on RHS of match. After matching that internal pointer moves ahead leaving no DOT to be matched for next word.
(?=\.|#|$) is a positive lookahead that doesn't match these characters but just looks ahead so pointer remains at the cat instead of DOT after cat..

Regular expersion repeat inside a pattern

I have the following text and I would like to preg_match_all what is within the {'s and }'s if it contains only a-zA-Z0-9 and :
some text,{SOMETHING21} {SOMETHI32NG:MORE}some msdf{TEXT:GET:2}sdfssdf sdf sdf
I am trying to match {SOMETHING21} {SOMETHI32NG:MORE} {TEXT:GET:2} there can be several :'s within the tag.
What I currently have is:
preg_match_all('/\{([a-zA-Z0-9\-]+)(\:([a-zA-Z0-9\-]+))*\}/', $from, $matches, PREG_SET_ORDER);
It works as expected for {SOMETHING21} and {SOMETHI32NG:MORE} but for {TEXT:GET:2} it only matches TEXT and 2
So it only matches the first and last word within the tag, and leaves the middle ones out of the $matches array. Is this even possible or should I just match them and then explode on : ?
-- edit --
Well the question isn't if I can get the tags, the question is if I can get them grouped without having to explode the results again. Even though my current regex finds all the results the subpattern does not come back with all the matches in $matches.
I hope the following will clear it up abit more:
\{ // the match has to start with {
([a-zA-Z0-9\-]+) // after the { the match needs to have alphanum consisting out of 1 or more characters
(
\: // if we have : it should be followed by alphanum consisting out of 1 or more characters
([a-zA-Z0-9\-]+) // <---- !! this is what it is about !! even though this subexpression is between brackets it is not put into $matches if more then one of these is found
)* // there could be none or more of the previous subexpression
\} // the match has to end with }

You can't get all the matched values of a capturing group, you only get the last one.
So you have to match the pattern:
preg_match_all('/{([a-z\d-]+(?::[a-z\d-]+)*)}/i', $from, $matches);
and then split each element in $matches[1] on :.

I used non-capture groupings to eliminate the inner groups, and just capture the outer complete colon-separated list.
$from = "some text,{SOMETHING21} {SOMETHI32NG:MORE}some msdf{TEXT:GET:2}sdfssdf sdf sdf";
preg_match_all('/\{((?:[a-zA-Z0-9\-]+)(?:\:(?:[a-zA-Z0-9\-]+))*)\}/', $from, $matches, PREG_SET_ORDER);
print_r($matches);
Result:
Array
(
[0] => Array
(
[0] => {SOMETHING21}
[1] => SOMETHING21
)
[1] => Array
(
[0] => {SOMETHI32NG:MORE}
[1] => SOMETHI32NG:MORE
)
[2] => Array
(
[0] => {TEXT:GET:2}
[1] => TEXT:GET:2
)
)

Maybe I didn't understand the requirement, but...
preg_match_all('/{[A-Za-z0-9:-]+}/', $from, $matches, PREG_PATTERN_ORDER);
results in:
Array
(
[0] => Array
(
[0] => {SOMETHING21}
[1] => {SOMETHI32NG:MORE}
[2] => {TEXT:GET:2}
)
)

Get position of all matches in group

Consider the following example:
$target = 'Xa,a,aX';
$pattern = '/X((a),?)*X/';
$matches = array();
preg_match_all($pattern,$target,$matches,PREG_OFFSET_CAPTURE|PREG_PATTERN_ORDER);
var_dump($matches);
What it does is returning only the last 'a' in the series, but what I need is all the 'a's.
Particularly, I need the position of ALL EACH OF the 'a's inside the string separately, thus PREG_OFFSET_CAPTURE.
The example is much more complex, see the related question: pattern matching an array, not their elements per se
Thanks

It groups a single match since the regex X((a),?)*X matches the entire string. The last ((a),?) will be grouped.
What you want to match is an a that has an X before it (and the start of the string), has a comma ahead of it, or has an X ahead of it (and the end of the string).
$target = 'Xa,a,aX';
$pattern = '/(?<=^X)a|a(?=X$|,)/';
preg_match_all($pattern, $target, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => Array
(
[0] => a
[1] => 1
)
[1] => Array
(
[0] => a
[1] => 3
)
[2] => Array
(
[0] => a
[1] => 5
)
)
)

When your regex includes X, it matches once. It finds one large match with groups in it. What you want is many matches, each with its own position.
So, in my opinion the best you can do is simply search for /a/ or /a,?/ without any X. Then matches[0] will contain all appearances of 'a'
If you need them between X, pre-select this part of the string.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting Urdu/Arabic phrases/sentences from a string - php

Related

How to write regex to find empty space after colon in string with no new line in text format?

regex repeated asterisk pattern matching

preg_match_all is only matching one

Regular expersion repeat inside a pattern

Get position of all matches in group

Categories

Resources