PHP regular expression, match the last occurence - php

I have a php function that splits product names from their color name in woocommerce.
The full string is generally of this form "product name - product color", like for example:
"Boxer Welbar - ligth grey" splits into "Boxer Welbar" and "light grey"
"Longjohn Gari - marine stripe" splits into "Longjohn Gari" and "marine stripe"
But in some cases it can be "Tee-shirt - product color"...and in this case the split doesn't work as I want, because the "-" in Tee-shirt is detected.
How to circumvent this problem? Should I use a "lookahead" statement in the regexp?
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/–|[\p{Pd}\xAD]|(–)/", $currenttitle);
return $splitted;
}

I'd go for a negative lookahead.
Something like this:
-(?!.*-)
that means to search for a - not followed by any other -
This works if in the color name there will never be a -

What about counting space characters that surround a dash?
For example:
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/\s(–|[\p{Pd}\xAD]|(–))\s/", $currenttitle);
return $splitted;
}
This automatically trims spaces from split parts as well.

If you have - as delimiter (note the spaces around the dash), you may simply use explode(...). If not, use
\s*-(?=[^-]+$)\s*
or
\w+-\w+(*SKIP)(*FAIL)|-
with preg_split(), see the demos on regex101.com (#2)
In PHP this could be:
<?php
$strings = ["Tee-shirt - product color", "Boxer Welbar - ligth grey", "Longjohn Gari - marine stripe"];
foreach ($strings as $string) {
print_r(explode(" - ", $string));
}
foreach ($strings as $string) {
print_r(preg_split("~\s*-(?=[^-]+$)\s*~", $string));
}
?>
Both approaches will yield
Array
(
[0] => Tee-shirt
[1] => product color
)
Array
(
[0] => Boxer Welbar
[1] => ligth grey
)
Array
(
[0] => Longjohn Gari
[1] => marine stripe
)
To collect the splitted items, use array_map(...):
$splitted = array_map( function($item) {return preg_split("~\s*-(?=[^-]+$)\s*~", $item); }, $strings);

Your sample inputs convey that the neighboring whitespace around the delimiting hyphen/dash is just as critical as the hyphen/dash itself.
I recommend doing all of the html and special entity decoding before executing your regex -- that's what these other functions are built for and it will make your regex pattern much simpler to read and maintain.
\p{Pd} will match any hyphen/dash. Reinforce the business logic in the code by declaring a maximum of 2 elements to be generated by the split.
As a general rule, I discourage declaring single-use variables.
Code: (Demo)
function product_name_split($prod_name) {
return preg_split(
"/ \p{Pd} /u",
strip_tags(
html_entity_decode(
$prod_name
)
),
2
);
}
$tests = [
'Tee-shirt - product color',
'Boxer Welbar - ligth grey',
'Longjohn Gari - marine stripe',
'En dash – green',
'Entity – blue',
];
foreach ($tests as $test) {
echo var_export(product_name_split($test, true)) . "\n";
}
Output:
array (
0 => 'Tee-shirt',
1 => 'product color',
)
array (
0 => 'Boxer Welbar',
1 => 'ligth grey',
)
array (
0 => 'Longjohn Gari',
1 => 'marine stripe',
)
array (
0 => 'En dash',
1 => 'green',
)
array (
0 => 'Entity',
1 => 'blue',
)

As usual, there are several options for this, this is one of them
explode — Split a string by a string
end — Set the internal pointer of an array to its last element
$currenttitle = 'Tee-shirt - product color';
$array = explode( '-', $currenttitle );
echo end( $array );

Related

PHP - Extract Custom Operators from String

I have a MySQL database back-end which is searched via PHP. I'm trying to implement custom operators for detailed searching.
Here is an example:
$string = "o:draw t:spell card";
Where o: = Search by Description and t: = Card Typing. This would be similar to a custom syntax from a site like Scryfall https://scryfall.com/docs/syntax
Here is my work in progress example for o:
$string = "o:draw t:spell card";
if (strpos($string, 'o:') !== false) {
$text = explode('o:', $string, 2)[1];
echo $text;
}
The output is: draw t:spell card
I'm trying to get the output to respond with just: draw
However, it would also need to be able to accomplish this "smartly". It may not always be in a specific order and may show up alongside other operators.
Any suggestions on how to build around a system like this in PHP?
Example user searches:
Search Description + Type (new Syntax):
o:draw a card t:spell card
Search Card Name/Description (what it currently does):
Time Wizard
Search Card Name/Description (what it currently does):
Draw a Card
Combination of new and old syntax:
Time Wizard t:effect monster
I am unable to take your input all the way through to sql strings, because I don't know if you have more columns to access than the two that you have listed in your question.
To begin to script a parser for your users' input, you will need to understand all of the possible inputs that you wish to accommodate and assign specific handlers for each occurrence. This task has the ability to turn into a rabbit hole -- in other words, it may become an ever-extending and diverging task. My answer is only to get you started; I'm not in it for the long haul.
Declare a whitelist of commands that you intend to respect. In my snippet, this is $commands. The whitelist of strings with special meaning must not be confused with any card name strings -- I have added the best fringe case that I can find in the YuGiOh database (a card name containing a colon).
Build a regex pattern using the whitelist to find qualifying delimiting spaces in input strings.
Iterate the array generated by the regex-split input and assess each element to determine if/how it will be integrated into the WHERE clause of your sql.
At this early stage, it is fair to only allow one card name per input string, but certainly possible that users will wish to enter multiple card types to filter their expected result set.
I am electing to sanitize the user input by trimming leading and trailing whitespace and converting all letters to lowercase. I assume you will be running a modern database which will make case-insensitive comparisons.
Code: (Demo)
$tests = [
'o:draw a card t:spell card',
'Time Wizard',
'Draw a Card',
'Time Wizard t:effect monster',
'A-Team: Trap Disposal Unit',
't: effect o: draw a card atk: 1400 def: 600 l: 4 t: fairy',
];
$commands = [
'o' => '', // seems to be useless to me
't' => 'type', // card type
'atk' => 'attack', // card's attack value
'def' => 'defense', // card's defense value
'l' => 'level', // card's level value
];
$pattern = '~ *(?=(?:' . implode('|', array_keys($commands)) . '):)~'; // create regex pattern using commands lookup
foreach ($tests as $test) {
foreach (preg_split($pattern, strtolower(trim($test)), 0, PREG_SPLIT_NO_EMPTY) as $component) {
$colectomy = explode(':', $component, 2);
if (count($colectomy) < 2) {
if ($colectomy[0] !== 'draw a card') { // "draw a card" seems worthless to the query
$result[$test]['cardName (old syntax)'] = $component;
}
} elseif ($colectomy[0] !== 'o') { // o command seems worthless to the query
if (isset($commands[$colectomy[0]])) {
$result[$test][$commands[$colectomy[0]]][] = $colectomy[1]; // enable capturing of multiple values of same command
} else {
$result[$test]['cardName (new syntax)'] = $component;
}
}
}
}
var_export($result);
Output:
array (
'o:draw a card t:spell card' =>
array (
'type' =>
array (
0 => 'spell card',
),
),
'Time Wizard' =>
array (
'cardName (old syntax)' => 'time wizard',
),
'Time Wizard t:effect monster' =>
array (
'cardName (old syntax)' => 'time wizard',
'type' =>
array (
0 => 'effect monster',
),
),
'A-Team: Trap Disposal Unit' =>
array (
'cardName (new syntax)' => 'a-team: trap disposal unit',
),
't: effect o: draw a card atk: 1400 def: 600 l: 4 t: fairy' =>
array (
'type' =>
array (
0 => ' effect',
1 => ' fairy',
),
'attack' =>
array (
0 => ' 1400',
),
'defense' =>
array (
0 => ' 600',
),
'level' =>
array (
0 => ' 4',
),
),
)
You can so something like below to extract key value from string.
$tags = ['o:', 't:'];
$str = "t:spell card o:draw a card";
$dataList = [];
foreach($tags as $tag){
$tagDataLevel1 = explode($tag, $str, 2)[1];
$expNew = explode(':', $tagDataLevel1,2);
if(count($expNew)==2){
$tagData= strrev(explode(' ',strrev($expNew[0]),2)[1]);
}else{
$tagData = $tagDataLevel1;
}
$dataList[$tag] = $tagData;
//overwrite string
$str = str_replace($tag . $tagData,"",$str);
}
$oldKeyWord = $str;
var_dump ($dataList);
echo $oldKeyWord;

Capture multiple repetitive group in regex

I'm using /{(\w+)\s+((\w+="\w+")\s*)+/ pattern to capture all attributes.
The problem is that it matches the input but can't group attribute one by one and just groups the last attribute.
[person name="Jackson" family="Smith"]
or
[car brand="Benz" type="SUV"]
The \G (continue) metacharacter is the hero to call upon here.
Code: (PHP Demo) (Regex101 Demo)
$tag = '[person name="Jackson" family="Smith"]';
var_export(preg_match_all('~(?:\G|\[\w+) (\w+)="(\w+)"~', $tag, $out) ? array_combine($out[1], $out[2]) : []);
Output:
array (
'name' => 'Jackson',
'family' => 'Smith',
)
If you need to pool the attributes&values with the tag name, only one loop is necessary for this too.
Code: (Demo)
$text = 'some text [person name="Jackson" family="Smith"] text [vehicle brand="Benz" type="SUV" doors="4" seats="7"]';
foreach (preg_match_all('~(?:\G(?!^)|\[(\w+)) (\w+)="(\w+)"~', $text, $out, PREG_SET_ORDER) ? $out : [] as $matches) {
if ($matches[1]) {
$tag = $matches[1]; // cache the tag name for reuse with subsequent attr/val pairs
}
$result[$tag][$matches[2]] = $matches[3];
}
var_export($result);
Output:
array (
'person' =>
array (
'name' => 'Jackson',
'family' => 'Smith',
),
'vehicle' =>
array (
'brand' => 'Benz',
'type' => 'SUV',
'doors' => '4',
'seats' => '7',
),
)
Due to the concerns of #Thefourthbird and #Jan, I have included a lookahead to match the closing square brace. I have also built in accommodation for the possibility of zero attributes in the tag. If given more time (sorry, don't have more), I could probably refine the following snippet to be slightly cleaner, but I believe I am accurately validating and extracting.
Code: (Demo)
$text = 'some text [person name="Jackson" family="Smith"] text [vehicle brand="Benz" type="SUV" doors="4" seats="7"] and [invalid closed="false" monkeywrench [lonetag] text [single gender="female"]';
foreach (preg_match_all('~\[(\w+)(?=(?: \w+="\w+")*])(]?)|(?:\G(?!^) (\w+)="(\w+)")~', $text, $out, PREG_SET_ORDER) ? $out : [] as $matches) {
if ($matches[2]) {
$result[$matches[1]] = [];
} elseif (!isset($matches[3])) {
$tag = $matches[1];
} else {
$result[$tag][$matches[3]] = $matches[4];
}
}
var_export($result);
Output:
array (
'person' =>
array (
'name' => 'Jackson',
'family' => 'Smith',
),
'vehicle' =>
array (
'brand' => 'Benz',
'type' => 'SUV',
'doors' => '4',
'seats' => '7',
),
'lonetag' =>
array (
),
'single' =>
array (
'gender' => 'female',
),
)
You can try \[\S+ ((?:[^"]+"){2}) ((?:[^"]+"){2})\]
Explanation:
\[ - match [ literallly
\S+ - mach one or more of non-whitespace characters
(?...) - non-capturing group
[^"]+" - match one or more characters other from " and repeat pattern two times due to {2}
\] - match ] literally
In first capturing group will be your first attribute, in second there will be the second attribute.
Demo
Better use two expressions (or a parser altogether) instead. Consider the following:
<?php
$junk = <<<END
lorem ipsum lorem ipsum
[person name="Jackson" family="Smith"]
lorem ipsum
[car brand="Benz" type="SUV"]
lorem ipsum lorem ipsum
END;
$tag = "~\[(?P<tag>\w+)[^][]*\]~";
$key_values = '~(?P<key>\w+)="(?P<value>[^"]*)"~';
preg_match_all($tag, $junk, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo "Name: {$match["tag"]}\n";
preg_match_all($key_values, $match[0], $attributes, PREG_SET_ORDER);
print_r($attributes);
}
?>
Here we have
\[(?P<tag>\w+)[^][]*\]
for likely tags and
(?P<key>\w+)="(?P<value>[^"]*)"
for key/value pairs. The rest is a foreach loop.

PHP - preg_replace on an array not working as expected

I have a PHP array that looks like this..
Array
(
[0] => post: 746
[1] => post: 2
[2] => post: 84
)
I am trying to remove the post: from each item in the array and return one that looks like this...
Array
(
[0] => 746
[1] => 2
[2] => 84
)
I have attempted to use preg_replace like this...
$array = preg_replace('/^post: *([0-9]+)/', $array );
print_r($array);
But this is not working for me, how should I be doing this?
You've missed the second argument of preg_replace function, which is with what should replace the match, also your regex has small problem, here is the fixed version:
preg_replace('/^post:\s*([0-9]+)$/', '$1', $array );
Demo: https://3v4l.org/64fO6
You don't have a pattern for the replacement, or a empty string place holder.
mixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject)
Is what you are trying to do (there are other args, but they are optional).
$array = preg_replace('/post: /', '', $array );
Should do it.
<?php
$array=array("post: 746",
"post: 2",
"post: 84");
$array = preg_replace('/^post: /', '', $array );
print_r($array);
?>
Array
(
[0] => 746
[1] => 2
[2] => 84
)
You could do this without using a regex using array_map and substr to check the prefix and return the string without the prefix:
$items = [
"post: 674",
"post: 2",
"post: 84",
];
$result = array_map(function($x){
$prefix = "post: ";
if (substr($x, 0, strlen($prefix)) == $prefix) {
return substr($x, strlen($prefix));
}
return $x;
}, $items);
print_r($result);
Result:
Array
(
[0] => 674
[1] => 2
[2] => 84
)
There are many ways to do this that don't involve regular expressions, which are really not needed for breaking up a simple string like this.
For example:
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = array_map(function ($n) {
$o = explode(': ', $n);
return (int)$o[1];
}, $input);
var_dump($output);
And here's another one that is probably even faster:
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = array_map(function ($n) {
return (int)substr($n, strpos($n, ':')+1);
}, $input);
var_dump($output);
If you don't need integers in the output just remove the cast to int.
Or just use str_replace, which in many cases like this is a drop in replacement for preg_replace.
<?php
$input = Array( 'post: 746', 'post: 2', 'post: 84');
$output = str_replace('post: ', '', $input);
var_dump($output);
You can use array_map() to iterate the array then strip out any non-digital characters via filter_var() with FILTER_SANITIZE_NUMBER_INT or trim() with a "character mask" containing the six undesired characters.
You can also let preg_replace() do the iterating for you. Using preg_replace() offers the most brief syntax, but regular expressions are often slower than non-preg_ techniques and it may be overkill for your seemingly simple task.
Codes: (Demo)
$array = ["post: 746", "post: 2", "post: 84"];
// remove all non-integer characters
var_export(array_map(function($v){return filter_var($v, FILTER_SANITIZE_NUMBER_INT);}, $array));
// only necessary if you have elements with non-"post: " AND non-integer substrings
var_export(preg_replace('~^post: ~', '', $array));
// I shuffled the character mask to prove order doesn't matter
var_export(array_map(function($v){return trim($v, ': opst');}, $array));
Output: (from each technique is the same)
array (
0 => '746',
1 => '2',
2 => '84',
)
p.s. If anyone is going to entertain the idea of using explode() to create an array of each element then store the second element of the array as the new desired string (and I wouldn't go to such trouble) be sure to:
split on or : (colon, space) or even post: (post, colon, space) because splitting on : (colon only) forces you to tidy up the second element's leading space and
use explode()'s 3rd parameter (limit) and set it to 2 because logically, you don't need more than two elements

Extracting javascript object from html using regex & php

I am trying to extract a specific JavaScript object from a page containing the usual HTML markup.
I have tried to use regex but i don't seem to be able to get it to parse the HTML correctly when the HTML contains a line break.
An example can be seen here: https://regex101.com/r/b8zN8u/2
The HTML i am trying to extract looks like this:
<script>
DATA.tracking.user = {
age: "19",
name: "John doe"
}
</script>
Using the following regex: DATA.tracking.user=(.*?)}
<?php
$re = '/DATA.tracking.user = (.*?)\}/m';
$str = '<script>
DATA.tracking.user = { age: "19", name: "John doe" }
</script>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
If i parse DATA.tracking.user = { age: "19", name: "John doe" } without any linebreaks, Then it works fine but if i try to parse:
DATA.tracking.user = {
age: "19",
name: "John doe"
}
It does not like dealing with the line breaks.
Any help would be greatly appreciated.
Thanks.
You will need to specify whitespaces (\s) in your pattern in order to parse the javascript code containing linebreaks.
For example, if you use the following code:
<?php
$re = '/DATA.tracking.user = \{\s*.*\s*.*\s*\}/';
$str = '<script>
DATA.tracking.user = {
age: "19",
name: "John doe"
}
</script>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches[0]);
?>
You will get the following output:
Array
(
[0] => DATA.tracking.user = {
age: "19",
name: "John doe"
}
)
The simple solution to your problem is to use the s pattern modifier to command the . (any character) to also match newline characters -- which it does not by default.
And you should:
escape your literal dots.
write the \{ outside of your capture group.
omit the m pattern modifier because you aren't using anchors.
...BUT...
If this was my task and I was going to be processing the data from the extracted string, I would probably start breaking up the components at extraction-time with the power of \G.
Code: (Demo) (Pattern Demo)
$htmls[] = <<<HTML
DATA.tracking.user = { age: "19", name: "John doe", int: 55 } // This works
HTML;
$htmls[] = <<<HTML
DATA.tracking.user = {
age: "20",
name: "Jane Doe",
int: 49
} // This does not works
HTML;
foreach ($htmls as $html) {
var_export(preg_match_all('~(?:\G(?!^),|DATA\.tracking\.user = \{)\s+([^:]+): (\d+|"[^"]*")~', $html, $out, PREG_SET_ORDER) ? $out : []);
echo "\n --- \n";
}
Output:
array (
0 =>
array (
0 => 'DATA.tracking.user = { age: "19"',
1 => 'age',
2 => '"19"',
),
1 =>
array (
0 => ', name: "John doe"',
1 => 'name',
2 => '"John doe"',
),
2 =>
array (
0 => ', int: 55',
1 => 'int',
2 => '55',
),
)
---
array (
0 =>
array (
0 => 'DATA.tracking.user = {
age: "20"',
1 => 'age',
2 => '"20"',
),
1 =>
array (
0 => ',
name: "Jane Doe"',
1 => 'name',
2 => '"Jane Doe"',
),
2 =>
array (
0 => ',
int: 49',
1 => 'int',
2 => '49',
),
)
---
Now you can simply iterate the matches and work with [1] (the keys) and [2] (the values). This is a basic solution, that can be further tailored to suit your project data. Admittedly, this doesn't account for values that contain an escaped double-quote. Adding this feature would be no trouble. Accounting for more complex value types may be more of a challenge.
You need to add the 's' modifier to the end of your regex - otherwise, "." does not include newlines. See this:
s (PCRE_DOTALL)
If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
So basically change your regex to be:
'/DATA.tracking.user = (.*?)\}/ms'
Also, you should quote your other dots (otherwise you will match "DATAYtrackingzZuser". So...
'/DATA\.tracking\.user = (.*?)\}/ms'
I'd also add in the open curly bracket and not enforce the single space around the equal sign, so:
'/DATA\.tracking\.user\s*=\s*\{(.*?)\}/ms'
Since you seem to be scraping/reading the page anyway (so you have a local copy), you can simply replace all the newline characters in the HTML page with whitespace characters, then it should work perfectly without even changing your script.
Refer to this for the ascii values:
https://www.techonthenet.com/ascii/chart.php

Convert Regexp in Js into PHP?

I have the following regular expression in javascript and i would like to have the exact same functionality (or similar) in php:
// -=> REGEXP - match "x bed" , "x or y bed":
var subject = query;
var myregexp1 = /(\d+) bed|(\d+) or (\d+) bed/img;
var match = myregexp1.exec(subject);
while (match != null){
if (match[1]) { "X => " + match[1]; }
else{ "X => " + match[2] + " AND Y => " + match[3]}
match = myregexp1.exec(subject);
}
This code searches a string for a pattern matching "x beds" or "x or y beds".
When a match is located, variable x and variable y are required for further processing.
QUESTION:
How do you construct this code snippet in php?
Any assistance appreciated guys...
You can use the regex unchanged. The PCRE syntax supports everything that Javascript does. Except the /g flag which isn't used in PHP. Instead you have preg_match_all which returns an array of results:
preg_match_all('/(\d+) bed|(\d+) or (\d+) bed/im', $subject, $matches,
PREG_SET_ORDER);
foreach ($matches as $match) {
PREG_SET_ORDER is the other trick here, and will keep the $match array similar to how you'd get it in Javascript.
I've found RosettaCode to be useful when answering these kinds of questions.
It shows how to do the same thing in various languages. Regex is just one example; they also have file io, sorting, all kinds of basic stuff.
You can use preg_match_all( $pattern, $subject, &$matches, $flags, $offset ), to run a regular expression over a string and then store all the matches to an array.
After running the regexp, all the matches can be found in the array you passed as third argument. You can then iterate trough these matches using foreach.
Without setting $flags, your array will have a structure like this:
$array[0] => array ( // An array of all strings that matched (e.g. "5 beds" or "8 or 9 beds" )
0 => "5 beds",
1 => "8 or 9 beds"
);
$array[1] => array ( // An array containing all the values between brackets (e.g. "8", or "9" )
0 => "5",
1 => "8",
2 => "9"
);
This behaviour isn't exactly the same, and I personally don't like it that much. To change the behaviour to a more "JavaScript-like"-one, set $flags to PREG_SET_ORDER. Your array will now have the same structure as in JavaScript.
$array[0] => array(
0 => "5 beds", // the full match
1 => "5", // the first value between brackets
);
$array[1] => array(
0 => "8 or 9 beds",
1 => "8",
2 => "9"
);

Categories