I have a CSV file with one of the fields holding state/country info, formatted like:
"Florida United States" or "Alberta Canada" or "Wellington New Zealand" - not comma or tab delimited between them, simply space delimited.
I have an array of all the potential countries as well.
What I am looking for, is a solution that, in a loop, I can split the State and Country to different variables, based on matching the country in the $countryarray that I have something like:
$countryarray=array("United States","Canada","New Zealand");
$userfield="Wellington New Zealand");
$somefunction=(match "New Zealand", extract into $country, the rest into $state)
Split won't do it straight up - because many of the countries AND states have spaces, but the original data set concatenated the state and country together with just a space...
TIA!
I'm a fan of the RegEx method that #Mike Morton mentioned. You can take an array of countries, implode them using the | which is a RegEx OR, and use that as an "ends with one of these" pattern.
Below I've come up with two ways to do this, a simple way and an arguably overly complicated way that does some extra escaping. To illustrate what that escaping does I've added a fake country called Country XYZ (formally ABC).
Here's the sample data that works with both methods, as well as a helper function that actually does the matching and echoing. The RegEx does named-capturing, too, which makes things really easy to deal with.
// Sample data
$data = [
'Wellington New Zealand',
'Florida United States of America',
'Quebec Canada',
'Something Country XYZ (formally ABC)',
];
// Array of all possible countries
$countries = [
'United States of America',
'Canada',
'New Zealand',
'Country XYZ (formally ABC)',
];
// The begining and ending pattern delimiter for the RegEx
$delim = '/';
function matchAndShowData(array $data, array $countries, string $delim, string $countryParts): void
{
$pattern = "^(?<region>.*?) (?<country>$countryParts)$";
foreach($data as $d) {
if(preg_match($delim . $pattern . $delim, $d, $matches)){
echo sprintf('%1$s, %2$s', $matches['region'], $matches['country']), PHP_EOL;
} else {
echo 'NO MATCH: ' . $d, PHP_EOL;
}
}
}
Option 1
The first option is a naïve implode. This method, however, will not find the country that includes parentheses.
matchAndShowData($data, $countries, $delim, implode('|', $countries));
Output
Wellington, New Zealand
Florida, United States of America
Quebec, Canada
NO MATCH: Something Country XYZ (formally ABC)
Option 2
The second option applies proper RegEx quoting of the countries, just in case they have special characters. If you are 100% certain you don't have any, this is overkill, but I personally have learned, after way too many hours of debugging, to just always quote, just in case.
$patternParts = array_map(fn(string $country) => preg_quote($country, $delim), $countries);
// Implode the cleaned countries using the RegEx pipe operator which means "OR"
matchAndShowData($data, $countries, $delim, implode('|', $patternParts));
Output
Wellington, New Zealand
Florida, United States of America
Quebec, Canada
Something, Country XYZ (formally ABC)
Note
If you don't expect your list of countries to change often you can echo the pattern out and then just bake that into your code which will probably shave a couple of milliseconds of execution, which in a tight loop might be worth it.
Demo
You can see a demo of this here: https://3v4l.org/CaNRZ
Prepare the array of countries for use in a regular expression with preg_quote().
Build a regex pattern that will match a space followed by one of the country values then the end of the string. A lookahead ((?= ... )) is used to ensure that those matched characters are not consumed/destroyed while exploding.
Save the 2-element returned array from preg_split() to the output array.
Code: (Demo)
$branches = array_map(fn($country) => preg_quote($country, '/'), $countries);
$result = [];
foreach ($data as $string) {
$result[] = preg_split('/ (?=(?:' . implode('|', $branches) . ')$)/', $string);
}
var_export($result);
Output:
array (
0 =>
array (
0 => 'Wellington',
1 => 'New Zealand',
),
1 =>
array (
0 => 'Florida',
1 => 'United States of America',
),
2 =>
array (
0 => 'Quebec',
1 => 'Canada',
),
3 =>
array (
0 => 'Something',
1 => 'Country XYZ (formally ABC)',
),
)
Note that if an item/row in the result array only has one element, then you know that the attempted split failed to match the country substring.
I use this same technique when splitting street name and street type (when things like "First Street North" (a multi-word street type)) happens.
Related
I have a php function that splits product names from their color name in woocommerce.
The full string is generally of this form "product name - product color", like for example:
"Boxer Welbar - ligth grey" splits into "Boxer Welbar" and "light grey"
"Longjohn Gari - marine stripe" splits into "Longjohn Gari" and "marine stripe"
But in some cases it can be "Tee-shirt - product color"...and in this case the split doesn't work as I want, because the "-" in Tee-shirt is detected.
How to circumvent this problem? Should I use a "lookahead" statement in the regexp?
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/–|[\p{Pd}\xAD]|(–)/", $currenttitle);
return $splitted;
}
I'd go for a negative lookahead.
Something like this:
-(?!.*-)
that means to search for a - not followed by any other -
This works if in the color name there will never be a -
What about counting space characters that surround a dash?
For example:
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/\s(–|[\p{Pd}\xAD]|(–))\s/", $currenttitle);
return $splitted;
}
This automatically trims spaces from split parts as well.
If you have - as delimiter (note the spaces around the dash), you may simply use explode(...). If not, use
\s*-(?=[^-]+$)\s*
or
\w+-\w+(*SKIP)(*FAIL)|-
with preg_split(), see the demos on regex101.com (#2)
In PHP this could be:
<?php
$strings = ["Tee-shirt - product color", "Boxer Welbar - ligth grey", "Longjohn Gari - marine stripe"];
foreach ($strings as $string) {
print_r(explode(" - ", $string));
}
foreach ($strings as $string) {
print_r(preg_split("~\s*-(?=[^-]+$)\s*~", $string));
}
?>
Both approaches will yield
Array
(
[0] => Tee-shirt
[1] => product color
)
Array
(
[0] => Boxer Welbar
[1] => ligth grey
)
Array
(
[0] => Longjohn Gari
[1] => marine stripe
)
To collect the splitted items, use array_map(...):
$splitted = array_map( function($item) {return preg_split("~\s*-(?=[^-]+$)\s*~", $item); }, $strings);
Your sample inputs convey that the neighboring whitespace around the delimiting hyphen/dash is just as critical as the hyphen/dash itself.
I recommend doing all of the html and special entity decoding before executing your regex -- that's what these other functions are built for and it will make your regex pattern much simpler to read and maintain.
\p{Pd} will match any hyphen/dash. Reinforce the business logic in the code by declaring a maximum of 2 elements to be generated by the split.
As a general rule, I discourage declaring single-use variables.
Code: (Demo)
function product_name_split($prod_name) {
return preg_split(
"/ \p{Pd} /u",
strip_tags(
html_entity_decode(
$prod_name
)
),
2
);
}
$tests = [
'Tee-shirt - product color',
'Boxer Welbar - ligth grey',
'Longjohn Gari - marine stripe',
'En dash – green',
'Entity – blue',
];
foreach ($tests as $test) {
echo var_export(product_name_split($test, true)) . "\n";
}
Output:
array (
0 => 'Tee-shirt',
1 => 'product color',
)
array (
0 => 'Boxer Welbar',
1 => 'ligth grey',
)
array (
0 => 'Longjohn Gari',
1 => 'marine stripe',
)
array (
0 => 'En dash',
1 => 'green',
)
array (
0 => 'Entity',
1 => 'blue',
)
As usual, there are several options for this, this is one of them
explode — Split a string by a string
end — Set the internal pointer of an array to its last element
$currenttitle = 'Tee-shirt - product color';
$array = explode( '-', $currenttitle );
echo end( $array );
I have a MySQL database back-end which is searched via PHP. I'm trying to implement custom operators for detailed searching.
Here is an example:
$string = "o:draw t:spell card";
Where o: = Search by Description and t: = Card Typing. This would be similar to a custom syntax from a site like Scryfall https://scryfall.com/docs/syntax
Here is my work in progress example for o:
$string = "o:draw t:spell card";
if (strpos($string, 'o:') !== false) {
$text = explode('o:', $string, 2)[1];
echo $text;
}
The output is: draw t:spell card
I'm trying to get the output to respond with just: draw
However, it would also need to be able to accomplish this "smartly". It may not always be in a specific order and may show up alongside other operators.
Any suggestions on how to build around a system like this in PHP?
Example user searches:
Search Description + Type (new Syntax):
o:draw a card t:spell card
Search Card Name/Description (what it currently does):
Time Wizard
Search Card Name/Description (what it currently does):
Draw a Card
Combination of new and old syntax:
Time Wizard t:effect monster
I am unable to take your input all the way through to sql strings, because I don't know if you have more columns to access than the two that you have listed in your question.
To begin to script a parser for your users' input, you will need to understand all of the possible inputs that you wish to accommodate and assign specific handlers for each occurrence. This task has the ability to turn into a rabbit hole -- in other words, it may become an ever-extending and diverging task. My answer is only to get you started; I'm not in it for the long haul.
Declare a whitelist of commands that you intend to respect. In my snippet, this is $commands. The whitelist of strings with special meaning must not be confused with any card name strings -- I have added the best fringe case that I can find in the YuGiOh database (a card name containing a colon).
Build a regex pattern using the whitelist to find qualifying delimiting spaces in input strings.
Iterate the array generated by the regex-split input and assess each element to determine if/how it will be integrated into the WHERE clause of your sql.
At this early stage, it is fair to only allow one card name per input string, but certainly possible that users will wish to enter multiple card types to filter their expected result set.
I am electing to sanitize the user input by trimming leading and trailing whitespace and converting all letters to lowercase. I assume you will be running a modern database which will make case-insensitive comparisons.
Code: (Demo)
$tests = [
'o:draw a card t:spell card',
'Time Wizard',
'Draw a Card',
'Time Wizard t:effect monster',
'A-Team: Trap Disposal Unit',
't: effect o: draw a card atk: 1400 def: 600 l: 4 t: fairy',
];
$commands = [
'o' => '', // seems to be useless to me
't' => 'type', // card type
'atk' => 'attack', // card's attack value
'def' => 'defense', // card's defense value
'l' => 'level', // card's level value
];
$pattern = '~ *(?=(?:' . implode('|', array_keys($commands)) . '):)~'; // create regex pattern using commands lookup
foreach ($tests as $test) {
foreach (preg_split($pattern, strtolower(trim($test)), 0, PREG_SPLIT_NO_EMPTY) as $component) {
$colectomy = explode(':', $component, 2);
if (count($colectomy) < 2) {
if ($colectomy[0] !== 'draw a card') { // "draw a card" seems worthless to the query
$result[$test]['cardName (old syntax)'] = $component;
}
} elseif ($colectomy[0] !== 'o') { // o command seems worthless to the query
if (isset($commands[$colectomy[0]])) {
$result[$test][$commands[$colectomy[0]]][] = $colectomy[1]; // enable capturing of multiple values of same command
} else {
$result[$test]['cardName (new syntax)'] = $component;
}
}
}
}
var_export($result);
Output:
array (
'o:draw a card t:spell card' =>
array (
'type' =>
array (
0 => 'spell card',
),
),
'Time Wizard' =>
array (
'cardName (old syntax)' => 'time wizard',
),
'Time Wizard t:effect monster' =>
array (
'cardName (old syntax)' => 'time wizard',
'type' =>
array (
0 => 'effect monster',
),
),
'A-Team: Trap Disposal Unit' =>
array (
'cardName (new syntax)' => 'a-team: trap disposal unit',
),
't: effect o: draw a card atk: 1400 def: 600 l: 4 t: fairy' =>
array (
'type' =>
array (
0 => ' effect',
1 => ' fairy',
),
'attack' =>
array (
0 => ' 1400',
),
'defense' =>
array (
0 => ' 600',
),
'level' =>
array (
0 => ' 4',
),
),
)
You can so something like below to extract key value from string.
$tags = ['o:', 't:'];
$str = "t:spell card o:draw a card";
$dataList = [];
foreach($tags as $tag){
$tagDataLevel1 = explode($tag, $str, 2)[1];
$expNew = explode(':', $tagDataLevel1,2);
if(count($expNew)==2){
$tagData= strrev(explode(' ',strrev($expNew[0]),2)[1]);
}else{
$tagData = $tagDataLevel1;
}
$dataList[$tag] = $tagData;
//overwrite string
$str = str_replace($tag . $tagData,"",$str);
}
$oldKeyWord = $str;
var_dump ($dataList);
echo $oldKeyWord;
Having some trouble normalizing some strings in PHP...
Given these test cases:
Van Fleur, Pat
Smith,John K
Smith, John Jr.
Smith,Jose Jr
I am attempting to normalize names in a list that use the format: Lastname,Firstname
Expected output for the test cases:
Van Fleur,Pat
Smith,John
Smith,John
Smith,Jose
I am using the following line, but appears I'm only getting a subset of these test cases accounted for.
Using this: strtok(trim(strtolower($name)), ' ')
I'm not great at regex, so really haven't ventured down that road yet.
Can you assist me with achieving the desired output using either regex or native functions?
No way around that, you need to somehow iterate over that data array and convert each entry:
<?php
$data = [
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
array_walk($data, function($value, $key) use (&$data) {
preg_match('|\s*(\w.+),\s*(\w+)|', $value, $token);
$data[$key] = sprintf('%s,%s', $token[1], $token[2]);
});
print_r($data);
The output obviously is:
Array
(
[0] => Van Fleur,Pat
[1] => Smith,John
[2] => Smith,John
[3] => Smith,Jose
)
An obvious alternative is something like that:
<?php
$input = [
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
$output = array_map(function($value) {
preg_match('|\s*(\w.+),\s*(\w+)|', $value, $token);
return sprintf('%s,%s', $token[1], $token[2]);
}, $input);
print_r($output);
But be careful here, such an approach won't scale well, since you actually double the memory footprint of the data that way...
So maybe that alternative would even be more elegant, since just as the first example it does an in-place change of the entries:
<?php
$data = [
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
foreach($data as &$entry) {
preg_match('|\s*(\w.+),\s*(\w+)|', $entry, $token);
$entry = sprintf('%s,%s', $token[1], $token[2]);
}
print_r($data);
Considering your comment below which describes a slightly different scenario I would add this suggestion:
$entry = preg_replace('|^\s*(\w.+),\s*(\w+)\s*.*$|', '$1,$2', $entry);
Capture the leading substring until the ,, then match (but don't capture) the comma and optional space, then greedily capture non-space characters, then just match the rest of the string so that the replacement value overwrites the full original value.
Using negated character classes speeds up the pattern. Here is a simple one-call method:
Pattern Demo
Code: (Demo)
$names=[
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
$names=preg_replace('/([^,]+), ?([^ ]+).*/','$1,$2',$names);
var_export($names);
Output:
array (
0 => 'Van Fleur,Pat',
1 => 'Smith,John',
2 => 'Smith,John',
3 => 'Smith,Jose',
)
Let's consider some more complex hypothetical inputs -- including names that don't need correcting.
Van Fleur, Pat // <-- 1 replacement
Smith,Josiah // <-- nothing to fix
Smith,John K // <-- 1 replacement
Smith,John Jacob Jingleheimer // <-- 1 long replacement
O'Shannahan-O'Neil, Sean Patrick Eamon // <-- double surname with apostrophes
de la Cruz, Bethania // <-- 3-word surname
Smith, John Jr. // <-- 2 replacements
Smith,Jose Jr // <-- 1 replacement
You can use my first posted pattern which is an efficient pattern, but it will execute replacements on names that do not require any fixes.
Alternatively, you can use this "capture-less" pattern: /,\K | [^,]*$/ with an empty replacement string. This will use many more steps, but will avoid performing needless replacing.
Code: (Demo)
$names=preg_replace('/,\K | [^,]*$/','',$names);
var_export($names);
Output:
array (
0 => 'Van Fleur,Pat',
1 => 'Smith,Josiah',
2 => 'Smith,John',
3 => 'Smith,John',
4 => 'O\'Shannahan-O\'Neil,Sean',
5 => 'de la Cruz,Bethania',
6 => 'Smith,John',
7 => 'Smith,Jose',
)
Lastly, if you have some deep-seated hate for regex (I certainly don't), you can use this method:
foreach($names as &$name){
$parts=explode(',',$name);
$name=$parts[0].','.explode(' ',ltrim($parts[1]),2)[0];
}
unset($name); // this is not required, but many recommend it to prevent issues downscript
var_export($names);
The decision about which one is best for your project, will come down to the quality of your real data and your personal tastes. I suggest running some comparative speed tests if optimization is a priority.
Try this:
^([^\,]+)\,\s?([^\s]+)
I have an input string like this:
"Day":June 8-10-2012,"Location":US,"City":Newyork
I need to match 3 value substrings:
June 8-10-2012
US
Newyork
I don't need the labels.
Per my comment above, if this is JSON, you should definitely use those functions as they are more suited for this.
However, you can use the following REGEX.
/:([a-zA-Z0-9\s-]*)/g
<?php
preg_match('/:([a-zA-Z0-9\s-]*)/', '"Day":June 8-10-2012,"Location":US,"City":Newyork', $matches);
print_r($matches);
The regex demo is here:
https://regex101.com/r/BbwVQ5/1
Here are a couple of simple ways:
Code: (Demo)
$string = '"Day":June 8-10-2012,"Location":US,"City":Newyork';
var_export(preg_match_all('/:\K[^,]+/', $string, $out) ? $out[0] : 'fail');
echo "\n\n";
var_export(preg_split('/,?"[^"]+":/', $string, 0, PREG_SPLIT_NO_EMPTY));
Output:
array (
0 => 'June 8-10-2012',
1 => 'US',
2 => 'Newyork',
)
array (
0 => 'June 8-10-2012',
1 => 'US',
2 => 'Newyork',
)
Pattern #1 Demo \K restarts the match after : so that a positive lookbehind can be avoided (saving "steps" / improving pattern efficiency) By matching all following characters that are not a comma, a capture group can be avoided (saving "steps" / improving pattern efficiency).
Patter #2 Demo ,? makes the comma optional and qualifies the leading double-quoted "key" to be matched (split on). The targeted substring to split on will match the full "key" substring and end on the following : colon.
This is a php example, but an algorithm for any language would do. What I specifically want to do is bubble up the United States and Canada to the top of the list. Here is an example of the array shortened for brevity.
array(
0 => '-- SELECT --',
1 => 'Afghanistan',
2 => 'Albania',
3 => 'Algeria',
4 => 'American Samoa',
5 => 'Andorra',)
The id's need to stay intact. So making them -1 or -2 will unfortunately not work.
What I usually do in these situations is to add a separate field called DisplayOrder or something similar. Everything defaults to, say, 1... You then sort by DisplayOrder and then the Name. If you want something higher or lower on the list, you can tweak the display order accordingly while keeping your normal IDs as-is.
-- Kevin Fairchild
My shortcut in similar cases is to add a space at the start of Canada and two spaces at the start of United States. If displaying these as options in a SELECT tag, the spaces are not visible but the sorting still brings them to the front.
However, that may be a little hacky in some contexts. In Java the thing to do would be to extend StringComparator, override the compare() method making the US and Canada special cases, then sort the list (or array) passing in your new comparator as the sort algorithm.
However I would imagine it might be simpler to just find the relevant entries in the array, remove them from the array and add them again at the start. If you are in some kind of framework which will re-sort the array, then it might not work. But in most cases that will do just fine.
[edit] I see that you are using a hash and not an array - so it will depend on how you are doing the sorting. Could you simply put the US into the hash with a key of -2, Canada with -1 and then sort by ID instead? Not having used PHP in anger for 11 years, I don't recall whether it has built-in sorting in its hashes or if you are doing that at the application level.
$a = array(
0 => '- select -',
1 => 'Afghanistan',
2 => 'Albania',
3 => 'Algeria',
80 => 'USA'
);
$temp = array();
foreach ($a as $k => $v) {
$v == 'USA'
? array_unshift($temp, array($k, $v))
: array_push($temp, array($k, $v));
}
foreach ($temp as $t) {
list ($k, $v) = $t;
echo "$k => $v\n";
}
The output is then:
80 => USA
0 => - select -
1 => Afghanistan
2 => Albania
3 => Algeria
You can not change the order of elements within the same array by "moving" an item around. What you can do it to build a new array that first has your favourite items and then adds anything else from the original countries array at the end:
$countries = array(
0 => '-- SELECT --',
1 => 'Afghanistan',
2 => 'Albania',
3 => 'Algeria',
4 => 'American Samoa',
5 => 'Andorra',
22 => 'Canada',
44 => 'United States',);
# tell what should be upfront (by id)
$favourites = array(0, 44, 22);
# add favourites at first
$ordered = array();
foreach($favourites as $id)
{
$ordered[$id] = $countries[$id];
}
# add everything else
$ordered += array_diff_assoc($countries, $ordered);
# result
print_r($ordered);
Demo
It's been ages since I don't know how to code. But yes.
array_unshift($queue, "United States", "Canada");
print_r($queue);
array_unshift — Prepend one or more elements to the beginning of an array