split string with preg_split on english (and non english letters) - php

I want to separate my sentence(s) into two parts. Because they are made of English letters and non english letters. I have regex I am using in preg_split method to get normal letters and characters. This though, works for opposite and I am left with only Japanese and not english.
String I work with:
すぐに諦めて昼寝をするかも知れない。 I may give up soon and just nap instead.
My attempt:
$parts = preg_split("/[ -~]+$/", $cleanline); // $cleanline is the string above
print_r($parts);
My result
Array ( [0] => すぐに諦めて昼寝をするかも知れない。 [1] => )
As you can see, I do get an empty second value. How can I get both the English and the non-English text into two different strings? Why is the English text not returning even if I use correct regex (from what I've been testing)?

You could use lookaround to split on boundary between non alphabetic and alphabetic + space
$str = 'すぐに諦めて昼寝をするかも知れない。 I may give up soon and just nap instead.';
$parts = preg_split("/(?<=[^a-z])(?=[a-z\h])|(?<=[a-z\h])(?=[^a-z])/i", $str, 2);
print_r($parts);
Output:
Array
(
[0] => すぐに諦めて昼寝をするかも知れない。
[1] => I may give up soon and just nap instead.
)

try mb_split instead of preg_split function.
mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8");
$parts = mb_split("/[ -~]+$/", $cleanline);

If you have two spaces between the two strings as shown in your example, you can split them easily with a simple \s{2} :
<?php
$s = "すぐに諦めて昼寝をするかも知れない。 I may give up soon and just nap instead.";
$s = preg_split("/\s{2}/", $s);
print_r($s);
?>
Output:
Array
(
[0] => すぐに諦めて昼寝をするかも知れない。
[1] => I may give up soon and just nap instead.
)
Demo: http://ideone.com/uD2W1Q

Related

regex - finding multiple occurances of a pattern and extracting a string [duplicate]

I have tried the non capturing group option ?:
Here is my data:
hello:"abcdefg"},"other stuff
Here is my regex:
/hello:"(.*?)"}/
Here is what it returns:
Array
(
[0] => Array
(
[0] => hello:"abcdefg"}
)
[1] => Array
(
[0] => abcdefg
)
)
I wonder, how can I make it so that [0] => abdefg and that [1] => doesnt exist?
Is there any way to do this? I feel like it would be much cleaner and improve my performance. I understand that regex is simply doing what I told it to do, that is showing me the whole string that it found, and the group inside the string. But how can I make it only return abcdefg, and nothing more? Is this possible to do?
Thanks.
EDIT: I am using the regex on a website that says it uses perl regex. I am not actually using the perl interpreter
EDIT Again: apparently I misread the website. It is indeed using PHP, and it is calling it with this function: preg_match_all('/hello:"(.*?)"}/', 'hello:"abcdefg"},"other stuff', $arr, PREG_PATTERN_ORDER);
I apologize for this error, I fixed the tags.
EDIT Again 2: This is the website http://www.solmetra.com/scripts/regex/index.php
preg_match_all
If you want a different captured string, you need to change your regex. Here I'm looking for anything not a double quote " between two quote " characters behind a : colon character.
<?php
$string = 'hello:"abcdefg"},"other stuff';
$pattern = '!(?<=:")[^"]+(?=")!';
preg_match_all($pattern,$string,$matches);
echo $matches[0][0];
?>
Output
abcdefg
If you were to print_r($matches) you would see that you have the default array and the matches in their own additional arrays. So to access the string you would need to use $matches[0][0] which provides the two keys to access the data. But you're always going to have to deal with arrays when you're using preg_match_all.
Array
(
[0] => Array
(
[0] => abcdefg
)
)
preg_replace
Alternatively, if you were to use preg_replace instead, you could replace all of the contents of the string except for your capture group, and then you wouldn't need to deal with arrays (but you need to know a little more about regex).
<?php
$string = 'hello:"abcdefg"},"other stuff';
$pattern = '!^[^:]+:"([^"]+)".+$!s';
$new_string = preg_replace($pattern,"$1",$string);
echo $new_string;
?>
Output
abcdefg
preg_match_all is returning exactly what is supposed to.
The first element is the entire string that matched the regex. Every other element are the capture groups.
If you just want the the capture group, then just ignore the 1st element.
preg_match_all('/hello:"(.*?)"}/', 'hello:"abcdefg"},"other stuff', $arr, PREG_PATTERN_ORDER);
$firstMatch = $arr[1];

How to separate words by uppercase letters?

I have string like:
SadnessSorrowSadnessSorrow
where words are concatenated without any space. Each word starts with a capital letter. I want to separate these words and select first 2 words to put in a new string.
I need to do this in a php application using preg_match function.
How should I go about it?
I tried using [A-Z], but somehow I am not getting it right.
Here, we can also split our string by the uppercase letters, maybe similar to:
$str = "SadnessSorrowSadnessSorrow";
$str_array = preg_split('/\B(?=[A-Z])/s', $str);
foreach ($str_array as $value) {
echo $value . "\n";
}
Based on bobble bubble's advice, it is much better to use \B(?=[A-Z]) instead of (?=[A-Z]), or we might use PREG_SPLIT_NO_EMPTY.
Output
Sadness
Sorrow
Sadness
Sorrow
The answer flashed once I posted the question:
preg_match_all('([A-Z][a-z]+)', 'SadnessSorrowSadnessSorrow', $matches);
It gives:
(
[0] => Sadness
[1] => Sorrow
[2] => Sadness
[3] => Sorrow
)

PHP regex, how can I make my regex only return one group?

I have tried the non capturing group option ?:
Here is my data:
hello:"abcdefg"},"other stuff
Here is my regex:
/hello:"(.*?)"}/
Here is what it returns:
Array
(
[0] => Array
(
[0] => hello:"abcdefg"}
)
[1] => Array
(
[0] => abcdefg
)
)
I wonder, how can I make it so that [0] => abdefg and that [1] => doesnt exist?
Is there any way to do this? I feel like it would be much cleaner and improve my performance. I understand that regex is simply doing what I told it to do, that is showing me the whole string that it found, and the group inside the string. But how can I make it only return abcdefg, and nothing more? Is this possible to do?
Thanks.
EDIT: I am using the regex on a website that says it uses perl regex. I am not actually using the perl interpreter
EDIT Again: apparently I misread the website. It is indeed using PHP, and it is calling it with this function: preg_match_all('/hello:"(.*?)"}/', 'hello:"abcdefg"},"other stuff', $arr, PREG_PATTERN_ORDER);
I apologize for this error, I fixed the tags.
EDIT Again 2: This is the website http://www.solmetra.com/scripts/regex/index.php
preg_match_all
If you want a different captured string, you need to change your regex. Here I'm looking for anything not a double quote " between two quote " characters behind a : colon character.
<?php
$string = 'hello:"abcdefg"},"other stuff';
$pattern = '!(?<=:")[^"]+(?=")!';
preg_match_all($pattern,$string,$matches);
echo $matches[0][0];
?>
Output
abcdefg
If you were to print_r($matches) you would see that you have the default array and the matches in their own additional arrays. So to access the string you would need to use $matches[0][0] which provides the two keys to access the data. But you're always going to have to deal with arrays when you're using preg_match_all.
Array
(
[0] => Array
(
[0] => abcdefg
)
)
preg_replace
Alternatively, if you were to use preg_replace instead, you could replace all of the contents of the string except for your capture group, and then you wouldn't need to deal with arrays (but you need to know a little more about regex).
<?php
$string = 'hello:"abcdefg"},"other stuff';
$pattern = '!^[^:]+:"([^"]+)".+$!s';
$new_string = preg_replace($pattern,"$1",$string);
echo $new_string;
?>
Output
abcdefg
preg_match_all is returning exactly what is supposed to.
The first element is the entire string that matched the regex. Every other element are the capture groups.
If you just want the the capture group, then just ignore the 1st element.
preg_match_all('/hello:"(.*?)"}/', 'hello:"abcdefg"},"other stuff', $arr, PREG_PATTERN_ORDER);
$firstMatch = $arr[1];

Split string depending on the existence of a leading character

In PHP, I need to split a string by ":" characters without a leading "*".
This is what using explode() does:
$string = "1*:2:3*:4";
explode(":", $string);
output: array("1*", "2", "3*", "4")
However the output I need is:
output: array("1*:2", "3*:4")
How would I achieve the desired output?
You're probably looking for preg_match_all() rather than explode(), as you are attempting a more complex split than explode() itself can handle. preg_match_all() will allow you to gather all of the parts of a string that match a specific pattern, expressed using a regular expression. The pattern you are looking for is something along the lines of:
anything except : followed by *: followed by anything but :
So, try this instead:
preg_match_all('/[^:]+\*:[^:]+/', $string, $matches);
print_r($matches);
Which will output something like:
Array
(
[0] => Array
(
[0] => 1*:2
[1] => 3*:4
)
)
Which you should be able to use in much the same way that you would use the results of explode() even if there is the added dimension in the array (it divides the matches into 'groups', and all your results match against the whole expression or the first (0th) group).
$str = '1*:2:3*:4';
$res = preg_split('~(?<!\*):~',$str);
print_r($res);
will output
Array
(
[0] => 1*:2
[1] => 3*:4
)
The pattern basically says:
split by [a colon that is not lead by an asterisk]

Regex For Get Last URL

I have:
stackoverflow.com/.../link/Eee_666/9_uUU/66_99U
What regex for /Eee_666/9_uUU/66_99U?
Eee_666, 9_uUU, and 66_99U is a random value
How can I solve it?
As simple as that:
$link = "stackoverflow.com/.../link/Eee_666/9_uUU/66_99U";
$regex = '~link/([^/]+)/([^/]+)/([^/]+)~';
# captures anything that is not a / in three different groups
preg_match_all($regex, $link, $matches);
print_r($matches);
Be aware though that it eats up any character expect the / (including newlines), so you either want to exclude other characters as well or feed the engine only strings with your format.
See a demo on regex101.com.
You can use \K here to makei more thorough.
stackoverflow\.com/.*?/link/\K([^/\s]+)/([^/\s]+)/([^/\s]+)
See demo.
https://regex101.com/r/jC8mZ4/2
In the case you don't how the length of the String:
$string = stackoverflow.com/.../link/Eee_666/9_uUU/66_99U
$regexp = ([^\/]+$)
result:
group1 = 66_99U
be careful it may also capture the end line caracter
For this kind of requirement, it's simpler to use preg_split combined with array_slice:
$url = 'stackoverflow.com/.../link/Eee_666/9_uUU/66_99U';
$elem = array_slice(preg_split('~/~', $url), -3);
print_r($elem);
Output:
Array
(
[0] => Eee_666
[1] => 9_uUU
[2] => 66_99U
)

Categories