Extract characters occurring before one of several forbidden characters - php

I want to discard all remaining characters in a string as soon as one of several unwanted characters is encountered.
As soon as a blacklisted character is encountered, the string before that point should be returned.
For instance, if I have an array:
$chars = array("a", "b", "c");
How would I go through the following string...
log dog hat bat
...and end up with:
log dog h

The strcspn function is what you are looking for.
<?php
$mask = "abc";
$string = "log dog hat bat";
$result = substr($string,0,strcspn($string,$mask));
var_dump($result);
?>

There is certainly nothing wrong with Vinko's answer and I might be more inclined to recommend that technique in a professional script because regex is likely to perform slower, but purely for a point of difference for researchers, regex could be used.
For the record, to convert the array of ['a', 'b', 'c'] to abc, just call implode($array) -- an empty glue string is not necessary.
Code: (Demo) -- split in half on first occurrence of a|b|c, then access first element
echo preg_split('~[abc]~', $string, 2)[0];
Code: (Demo) -- match leading substring of non-a|b|c characters, then access first element
echo preg_match('~^[^abc]+~', $string, $match) ? $match[0] : '';
I should state that if any of your blacklisted characters have special meaning to the regex engine while inside of a character class, then they will need to be escaped.

Related

Match all substrings that end with 4 digits using regular expressions

I am trying to split a string in php, which looks like this:
ABCDE1234ABCD1234ABCDEF1234
Into an array of string which, in this case, would look like this:
ABCDE1234
ABCD1234
ABCDEF1234
So the pattern is "an undefined number of letters, and then 4 digits, then an undefined number of letters and 4 digits etc."
I'm trying to split the string using preg_split like this:
$pattern = "#[0-9]{4}$#";
preg_split($pattern, $stringToSplit);
And it returns an array containing the full string (not split) in the first element.
I'm guessing the problem here is my regex as I don't fully understand how to use them, and I am not sure if I'm using it correctly.
So what would be the correct regex to use?
You don't want preg_split, you want preg_match_all:
$str = 'ABCDE1234ABCD1234ABCDEF1234';
preg_match_all('/[a-z]+[0-9]{4}/i', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
array(3) {
[0]=>
string(9) "ABCDE1234"
[1]=>
string(8) "ABCD1234"
[2]=>
string(10) "ABCDEF1234"
}
}
PHP uses PCRE-style regexes which let you do lookbehinds. You can use this to see if there are 4 digits "behind" you. Combine that with a lookahead to see if there's a letter ahead of you, and you get this:
(?<=\d{4})(?=[a-z])
Notice the dotted lines on the Debuggex Demo page. Those are the points you want to split on.
In PHP this would be:
var_dump(preg_split('/(?<=\d{4})(?=[a-z])/i', 'ABCDE1234ABCD1234ABCDEF1234'));
Use the principle of contrast:
\D+\d{4}
# requires at least one non digit
# followed by exactly four digits
See a demo on regex101.com.
In PHP this would be:
<?php
$string = 'ABCDE1234ABCD1234ABCDEF1234';
$regex = '~\D+\d{4}~';
preg_match_all($regex, $string, $matches);
?>
See a demo on ideone.com.
I'm no good at regex so here is the road less traveled:
<?php
$s = 'ABCDE1234ABCD1234ABCDEF1234';
$nums = range(0,9);
$num_hit = 0;
$i = 0;
$arr = array();
foreach(str_split($s) as $v)
{
if(isset($nums[$v]))
{
++$num_hit;
}
if(!isset($arr[$i]))
{
$arr[$i] = '';
}
$arr[$i].= $v;
if($num_hit === 4)
{
++$i;
$num_hit = 0;
}
}
print_r($arr);
First, why is your attempted pattern not delivering the desired output? Because the $ anchor tells the function to explode the string by using the final four numbers as the "delimiter" (characters that should be consuming while dividing the string into separate parts).
Your result:
array (
0 => 'ABCDE1234ABCD1234ABCDEF', // an element of characters before the last four digits
1 => '', // an empty element containing the non-existent characters after the four digits
)
In plain English, to fix your pattern, you must:
Not consume any characters while exploding and
Ensure that no empty elements are generated.
My snippet is at the bottom of this post.
Second, there seems to be some debate about what regex function to use (or even if regex is a preferrable tool).
My stance is that using a non-regex method will require a long-winded block of lines which will be equally if not more difficult to read than a regex pattern. Using regex affords you to generate your result in one-line and not in an unsightly fashion. So let's dispose of iterated sets of conditions for this task.
Now the critical concern is whether this task is simply "extracting" data from a consistent and valid string (case "A"), or if it is "validating AND extracting" data from a string (case"B") because the input cannot be 100 trusted to be consistent/correct.
In case A, you needn't concern yourself with producing valid elements in the output, so preg_split() or preg_match_all() are good candidates.
In case B, preg_split() would not be advisable, because it only hunts for delimiting substrings -- it remains ignorant of all other characters in the string.
Assuming this task is case A, then a decision is still pending about the better function to call. Well, both functions generate an array, but preg_match_all() creates a multidimensional array while you desire a flat array (like preg_split() provides). This means you would need to add a new variable to the global scope ($matches) and append [0] to the array to access the desired fullstring matches. To someone who doesn't understand regex patterns, this may border on the bad practice of using "magic numbers".
For me, I strive to code for Directness and Accuracy, then Efficiency, then Brevity and Clarity. Since you're not likely to notice any performance drops while performing such a small operation, efficiency isn't terribly important. I just want to make some comparisons to highlight the cost of a pattern that leverages only look-arounds or a pattern that misses an oportunity to greedily match predictable characters.
/(?<=\d{4})(?=[a-z])/i 79 steps (Demo)
~\d{4}\K~ 25 steps (Demo)
/[a-z]+[0-9]{4}\K/i 13 steps (Demo)
~\D+[0-9]{4}\K~ 13 steps (Demo)
~\D+\d{4}\K~ 13 steps (Demo)
FYI, \K is a metacharacter that means "restart the fullstring match", in other words "forget/release all previously matched characters up to this point". This effectively ensures that no characters are lost while spitting.
Suggested technique: (Demo)
var_export(
preg_split(
'~\D+\d{4}\K~', // pattern
'ABCDE1234ABCD1234ABCDEF1234', // input
0, // make unlimited explosions
PREG_SPLIT_NO_EMPTY // exclude empty elements
)
);
Output:
array (
0 => 'ABCDE1234',
1 => 'ABCD1234',
2 => 'ABCDEF1234',
)

Separate hex blocks in PHP

Anyone knows an way to "separate" the blocks of this hex code?
[49cd0d18] -> 1238175000
[00010000] -> 1
[0069] -> 105
[543ace68] -> timestamp
000000000000000000000000000000000000000
Complete:
49cd0d1800010000543ace68000000000000000000000000000000000000000
Oh, of course... This values, can be different... I just know, that will not be the same. So, I need to know how to "count" blocks, and then, "cut".
I'll be very grateful with your help!
Regex are easy solution for problem like theses:
You can see the regex on that link: https://regex101.com/r/qP1bC7/1
Note: Don't forget to put delimiter (the slashes in my example) around your regex when you use it in your code :
/^(\w{8})(\w{8})(\w{4})(\w{8})(\w{39})$/
The caret and the dollar sign delimit respectively the beginning and the ending of the string.
The parenthesis are capturing groups.
\w match any letter (A-Z in lower and upper case), the digits (0-9) and underscore (_).
{8} means that it must match exactly 8 characters
And you can see an example of the code here:
http://sandbox.onlinephpfunctions.com/code/c3a0ec3a45c53eb2c1b8e21cb978253ea4a28e52
The third parameter is an array to store the match of the regex (it is passed by reference, so you have to create it before using it). The first (0) index will be the whole match and the successive index (1-6) will the result of the capturing groups (there are 5 of them).
You could also extract substring with PHP native functions.
$string = "49cd0d18000100000069543ace68000000000000000000000000000000000000000";
$matches = array();
$matches[] = substr($string, 0, 8);
$matches[] = substr($string, 8, 8);
$matches[] = substr($string, 16, 4);
$matches[] = substr($string, 20, 8);
$matches[] = substr($string, 28, 39);
var_dump($matches);
You can test the code here: http://sandbox.onlinephpfunctions.com/code/18935d55feb86dffdc17f8854572e0935b4aab0e
Additional note: PHP native functions are faster than regex. The regular expressions have to be compile every time you use them (but PHP keep a pool of the last 1000 regexes used). You can benchmark both solutions if performance is an important matter. Otherwise, I'd say that both solutions are pretty equivalent.
Good success and don't forget to like,
Jonathan Parent-Lévesque from Montreal

PHP - know characters failed in a preg_match function

There is a method to know which characters does not match a preg_match function?
For example:
preg_match('/^[a-z]*$/i', 'Hello World!');
Is there some function to know the incorrect char, in this case spance and "!"?
Thanks for your replies, but the problem in your examples is you don't indicate the begin and the end of the string. Your examples works with string contained in another one and not with the string that is exactly like I defined in the pattern.
For example, if I had to validate the italian fiscal code of a subject, composed by a string formatted like this:
XXX XXX YY X YY X YYY X (X = letter, Y = number - without spaces)
which pattern is:
'/^[A-Z]{6}[0-9]{2}[A-Z]{1}[0-9]{2}[A-Z]{1}[0-9]{3}[A-Z]{1}$/i'
I must validate the string that match exactly what I defined in the pattern.
If I use your code and I wrong 1 (only 1) character, the whole string was returned as error.
http://eval.in/9178
The problem of the reverse pattern occurs in a complex pattern, where are inserted the AND or the OR.
What I want to know is why the preg_match fails and not only if it fails or not.
Have you tried something like this?
$nonMatchingCharacters = preg_replace('/[a-z]/', '', $wholeString);
That should strip out the 'legal' characters, leaving only the ones that you want to mention in your validation error message.
You could also do other treatments like...
$nonMatchingCharactersArray = array_unique(explode('', $nonMatchingCharacters));
...if you want an array of unique, non-matching characters, and not just a string with bits stripped out of it.
That will indicate you the space and !
preg_match_all('/[^a-z]/i', 'Hello World!', $matches);
var_dump($matches);
http://eval.in/9132
Just remove everything that matches with preg_replace, then split into an array what remains.
<?php
$str = preg_replace('/([0-9]{2}[a-z]*)/i', '', '03Hello 02World!');
$characters = str_split($str);
var_dump($characters);
http://eval.in/9152

preg_replace or regex string translation

I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go

Splitting string containing letters and numbers not separated by any particular delimiter in PHP

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.
Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.
For example:
"Hi, my name is Bob. I m 19yo and 170cm tall"
Should be tokenized to:
- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall
Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.
Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.
'123abc' will be ['123', 'abc']
'abc123' will be ['abc', '123']
'abc123xyz' will be ['abc', '123', 'xyz']
and so on.
What is the best way to achieve it in PHP?
I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers
You can use preg_split
$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);
When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.
http://codepad.org/i4Y6r6VS
how about this:
you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.
just an idea, but imho might work for you.
EDIT:
try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)
<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);
print_r($matches[0]);
print $str;

Categories