Having some trouble normalizing some strings in PHP...
Given these test cases:
Van Fleur, Pat
Smith,John K
Smith, John Jr.
Smith,Jose Jr
I am attempting to normalize names in a list that use the format: Lastname,Firstname
Expected output for the test cases:
Van Fleur,Pat
Smith,John
Smith,John
Smith,Jose
I am using the following line, but appears I'm only getting a subset of these test cases accounted for.
Using this: strtok(trim(strtolower($name)), ' ')
I'm not great at regex, so really haven't ventured down that road yet.
Can you assist me with achieving the desired output using either regex or native functions?
No way around that, you need to somehow iterate over that data array and convert each entry:
<?php
$data = [
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
array_walk($data, function($value, $key) use (&$data) {
preg_match('|\s*(\w.+),\s*(\w+)|', $value, $token);
$data[$key] = sprintf('%s,%s', $token[1], $token[2]);
});
print_r($data);
The output obviously is:
Array
(
[0] => Van Fleur,Pat
[1] => Smith,John
[2] => Smith,John
[3] => Smith,Jose
)
An obvious alternative is something like that:
<?php
$input = [
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
$output = array_map(function($value) {
preg_match('|\s*(\w.+),\s*(\w+)|', $value, $token);
return sprintf('%s,%s', $token[1], $token[2]);
}, $input);
print_r($output);
But be careful here, such an approach won't scale well, since you actually double the memory footprint of the data that way...
So maybe that alternative would even be more elegant, since just as the first example it does an in-place change of the entries:
<?php
$data = [
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
foreach($data as &$entry) {
preg_match('|\s*(\w.+),\s*(\w+)|', $entry, $token);
$entry = sprintf('%s,%s', $token[1], $token[2]);
}
print_r($data);
Considering your comment below which describes a slightly different scenario I would add this suggestion:
$entry = preg_replace('|^\s*(\w.+),\s*(\w+)\s*.*$|', '$1,$2', $entry);
Capture the leading substring until the ,, then match (but don't capture) the comma and optional space, then greedily capture non-space characters, then just match the rest of the string so that the replacement value overwrites the full original value.
Using negated character classes speeds up the pattern. Here is a simple one-call method:
Pattern Demo
Code: (Demo)
$names=[
'Van Fleur, Pat',
'Smith,John K',
'Smith, John Jr.',
'Smith,Jose Jr'
];
$names=preg_replace('/([^,]+), ?([^ ]+).*/','$1,$2',$names);
var_export($names);
Output:
array (
0 => 'Van Fleur,Pat',
1 => 'Smith,John',
2 => 'Smith,John',
3 => 'Smith,Jose',
)
Let's consider some more complex hypothetical inputs -- including names that don't need correcting.
Van Fleur, Pat // <-- 1 replacement
Smith,Josiah // <-- nothing to fix
Smith,John K // <-- 1 replacement
Smith,John Jacob Jingleheimer // <-- 1 long replacement
O'Shannahan-O'Neil, Sean Patrick Eamon // <-- double surname with apostrophes
de la Cruz, Bethania // <-- 3-word surname
Smith, John Jr. // <-- 2 replacements
Smith,Jose Jr // <-- 1 replacement
You can use my first posted pattern which is an efficient pattern, but it will execute replacements on names that do not require any fixes.
Alternatively, you can use this "capture-less" pattern: /,\K | [^,]*$/ with an empty replacement string. This will use many more steps, but will avoid performing needless replacing.
Code: (Demo)
$names=preg_replace('/,\K | [^,]*$/','',$names);
var_export($names);
Output:
array (
0 => 'Van Fleur,Pat',
1 => 'Smith,Josiah',
2 => 'Smith,John',
3 => 'Smith,John',
4 => 'O\'Shannahan-O\'Neil,Sean',
5 => 'de la Cruz,Bethania',
6 => 'Smith,John',
7 => 'Smith,Jose',
)
Lastly, if you have some deep-seated hate for regex (I certainly don't), you can use this method:
foreach($names as &$name){
$parts=explode(',',$name);
$name=$parts[0].','.explode(' ',ltrim($parts[1]),2)[0];
}
unset($name); // this is not required, but many recommend it to prevent issues downscript
var_export($names);
The decision about which one is best for your project, will come down to the quality of your real data and your personal tastes. I suggest running some comparative speed tests if optimization is a priority.
Try this:
^([^\,]+)\,\s?([^\s]+)
Related
I have a CSV file with one of the fields holding state/country info, formatted like:
"Florida United States" or "Alberta Canada" or "Wellington New Zealand" - not comma or tab delimited between them, simply space delimited.
I have an array of all the potential countries as well.
What I am looking for, is a solution that, in a loop, I can split the State and Country to different variables, based on matching the country in the $countryarray that I have something like:
$countryarray=array("United States","Canada","New Zealand");
$userfield="Wellington New Zealand");
$somefunction=(match "New Zealand", extract into $country, the rest into $state)
Split won't do it straight up - because many of the countries AND states have spaces, but the original data set concatenated the state and country together with just a space...
TIA!
I'm a fan of the RegEx method that #Mike Morton mentioned. You can take an array of countries, implode them using the | which is a RegEx OR, and use that as an "ends with one of these" pattern.
Below I've come up with two ways to do this, a simple way and an arguably overly complicated way that does some extra escaping. To illustrate what that escaping does I've added a fake country called Country XYZ (formally ABC).
Here's the sample data that works with both methods, as well as a helper function that actually does the matching and echoing. The RegEx does named-capturing, too, which makes things really easy to deal with.
// Sample data
$data = [
'Wellington New Zealand',
'Florida United States of America',
'Quebec Canada',
'Something Country XYZ (formally ABC)',
];
// Array of all possible countries
$countries = [
'United States of America',
'Canada',
'New Zealand',
'Country XYZ (formally ABC)',
];
// The begining and ending pattern delimiter for the RegEx
$delim = '/';
function matchAndShowData(array $data, array $countries, string $delim, string $countryParts): void
{
$pattern = "^(?<region>.*?) (?<country>$countryParts)$";
foreach($data as $d) {
if(preg_match($delim . $pattern . $delim, $d, $matches)){
echo sprintf('%1$s, %2$s', $matches['region'], $matches['country']), PHP_EOL;
} else {
echo 'NO MATCH: ' . $d, PHP_EOL;
}
}
}
Option 1
The first option is a naïve implode. This method, however, will not find the country that includes parentheses.
matchAndShowData($data, $countries, $delim, implode('|', $countries));
Output
Wellington, New Zealand
Florida, United States of America
Quebec, Canada
NO MATCH: Something Country XYZ (formally ABC)
Option 2
The second option applies proper RegEx quoting of the countries, just in case they have special characters. If you are 100% certain you don't have any, this is overkill, but I personally have learned, after way too many hours of debugging, to just always quote, just in case.
$patternParts = array_map(fn(string $country) => preg_quote($country, $delim), $countries);
// Implode the cleaned countries using the RegEx pipe operator which means "OR"
matchAndShowData($data, $countries, $delim, implode('|', $patternParts));
Output
Wellington, New Zealand
Florida, United States of America
Quebec, Canada
Something, Country XYZ (formally ABC)
Note
If you don't expect your list of countries to change often you can echo the pattern out and then just bake that into your code which will probably shave a couple of milliseconds of execution, which in a tight loop might be worth it.
Demo
You can see a demo of this here: https://3v4l.org/CaNRZ
Prepare the array of countries for use in a regular expression with preg_quote().
Build a regex pattern that will match a space followed by one of the country values then the end of the string. A lookahead ((?= ... )) is used to ensure that those matched characters are not consumed/destroyed while exploding.
Save the 2-element returned array from preg_split() to the output array.
Code: (Demo)
$branches = array_map(fn($country) => preg_quote($country, '/'), $countries);
$result = [];
foreach ($data as $string) {
$result[] = preg_split('/ (?=(?:' . implode('|', $branches) . ')$)/', $string);
}
var_export($result);
Output:
array (
0 =>
array (
0 => 'Wellington',
1 => 'New Zealand',
),
1 =>
array (
0 => 'Florida',
1 => 'United States of America',
),
2 =>
array (
0 => 'Quebec',
1 => 'Canada',
),
3 =>
array (
0 => 'Something',
1 => 'Country XYZ (formally ABC)',
),
)
Note that if an item/row in the result array only has one element, then you know that the attempted split failed to match the country substring.
I use this same technique when splitting street name and street type (when things like "First Street North" (a multi-word street type)) happens.
I have a php function that splits product names from their color name in woocommerce.
The full string is generally of this form "product name - product color", like for example:
"Boxer Welbar - ligth grey" splits into "Boxer Welbar" and "light grey"
"Longjohn Gari - marine stripe" splits into "Longjohn Gari" and "marine stripe"
But in some cases it can be "Tee-shirt - product color"...and in this case the split doesn't work as I want, because the "-" in Tee-shirt is detected.
How to circumvent this problem? Should I use a "lookahead" statement in the regexp?
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/–|[\p{Pd}\xAD]|(–)/", $currenttitle);
return $splitted;
}
I'd go for a negative lookahead.
Something like this:
-(?!.*-)
that means to search for a - not followed by any other -
This works if in the color name there will never be a -
What about counting space characters that surround a dash?
For example:
function product_name_split($prod_name) {
$currenttitle = strip_tags($prod_name);
$splitted = preg_split("/\s(–|[\p{Pd}\xAD]|(–))\s/", $currenttitle);
return $splitted;
}
This automatically trims spaces from split parts as well.
If you have - as delimiter (note the spaces around the dash), you may simply use explode(...). If not, use
\s*-(?=[^-]+$)\s*
or
\w+-\w+(*SKIP)(*FAIL)|-
with preg_split(), see the demos on regex101.com (#2)
In PHP this could be:
<?php
$strings = ["Tee-shirt - product color", "Boxer Welbar - ligth grey", "Longjohn Gari - marine stripe"];
foreach ($strings as $string) {
print_r(explode(" - ", $string));
}
foreach ($strings as $string) {
print_r(preg_split("~\s*-(?=[^-]+$)\s*~", $string));
}
?>
Both approaches will yield
Array
(
[0] => Tee-shirt
[1] => product color
)
Array
(
[0] => Boxer Welbar
[1] => ligth grey
)
Array
(
[0] => Longjohn Gari
[1] => marine stripe
)
To collect the splitted items, use array_map(...):
$splitted = array_map( function($item) {return preg_split("~\s*-(?=[^-]+$)\s*~", $item); }, $strings);
Your sample inputs convey that the neighboring whitespace around the delimiting hyphen/dash is just as critical as the hyphen/dash itself.
I recommend doing all of the html and special entity decoding before executing your regex -- that's what these other functions are built for and it will make your regex pattern much simpler to read and maintain.
\p{Pd} will match any hyphen/dash. Reinforce the business logic in the code by declaring a maximum of 2 elements to be generated by the split.
As a general rule, I discourage declaring single-use variables.
Code: (Demo)
function product_name_split($prod_name) {
return preg_split(
"/ \p{Pd} /u",
strip_tags(
html_entity_decode(
$prod_name
)
),
2
);
}
$tests = [
'Tee-shirt - product color',
'Boxer Welbar - ligth grey',
'Longjohn Gari - marine stripe',
'En dash – green',
'Entity – blue',
];
foreach ($tests as $test) {
echo var_export(product_name_split($test, true)) . "\n";
}
Output:
array (
0 => 'Tee-shirt',
1 => 'product color',
)
array (
0 => 'Boxer Welbar',
1 => 'ligth grey',
)
array (
0 => 'Longjohn Gari',
1 => 'marine stripe',
)
array (
0 => 'En dash',
1 => 'green',
)
array (
0 => 'Entity',
1 => 'blue',
)
As usual, there are several options for this, this is one of them
explode — Split a string by a string
end — Set the internal pointer of an array to its last element
$currenttitle = 'Tee-shirt - product color';
$array = explode( '-', $currenttitle );
echo end( $array );
Using str_replace() to replace values in a couple paragraphs of text data, it seems to do so but in an odd order. The values to be replaced are in a hard-coded array while the replacements are in an array from a query provided by a custom function called DBConnect().
I used print_r() on both to verify that they are correct and they are: both have the same number of entries and are in the same order but the on-screen results are mismatched. I expected this to be straightforward and didn't think it needed any looping for this simple task as str_replace() itself usually handles that but did I miss something?
$replace = array('[MyLocation]','[CustLocation]','[MilesInc]','[ExtraDoc]');
$replacements = DBConnect($sqlPrices,"select",$siteDB);
$PageText = str_replace($replace,$replacements,$PageText);
and $replacements is:
Array
(
[0] => 25
[MyLocation] => 25
[1] => 45
[CustLocation] => 45
[2] => 10
[MilesInc] => 10
[3] => 10
[ExtraDoc] => 10
)
Once I saw what the $replacements array actually looked like, I was able to fix it by filtering out the numeric keys.
$replace = array('[MyLocation]','[CustLocation]','[MilesInc]','[ExtraDoc]');
$replacements = DBConnect($sqlPrices,"select",$siteDB);
foreach ($replacements as $key=>$value) :
if (!is_numeric($key)) $newArray[$key] = $value;
endforeach;
$PageText = str_replace($replace,$newArray,$PageText);
The former $replacements array, filtered to $newArray, looks like this:
Array
(
[MyLocation] => 25
[CustLocation] => 45
[MilesInc] => 10
[ExtraDoc] => 10
)
-- edited: Removed some non sense statements --
#DonP, what you are trying to do is possible.
In my opinion, the strtr() function could be more beneficial to you. All you need to make a few adjustments in your code like this ...
<?php
$replacements = DBConnect($sqlPrices,"select",$siteDB);
$PageText = strtr($PageText, [
'[MyLocation]' => $replacements['MyLocation'],
'[CustLocation]' => $replacements['CustLocation'],
'[MilesInc]' => $replacements['MilesInc'],
'[ExtraDoc]' => $replacements['ExtraDoc'],
]);
?>
This code is kinda verbose and requires writing repetitive strings. Once you understand the way it works, you can use some loops or array functions to refactor it. For example, you could use the following more compact version ...
<?php
// Reference fields.
$fields = ['MyLocation', 'CustLocation', 'MilesInc', 'ExtraDoc'];
// Creating the replacement pairs.
$replacementPairs = [];
foreach($fields as $field){
$replacementPairs["[{$field}]"] = $replacements[$field];
}
// Perform the replacements.
$PageText = strtr($PageText, $replacementPairs);
?>
I am parsing HTML strings to get values in PHP and write them in database. Here is an example string:
<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b> +371 12345678, +371 23456789<br>
<b>E-mail: </b>info#example.com<br>
The string can be formatted in random manners. It can contain additional keys that I am not parsing out and it can contain duplicate keys. It can also contain only some of the keys that I am interested in or be completely empty. HTML can also be broken (example tag: <br). I have decided that I will follow the rules that entries are separated by \n and are in the form key: value + some HTML.
First, I use this code to make the string parseable:
$parse = strip_tags($string);
$parse = str_replace(':', '=', $parse);
$parse = str_replace("\n", '&', $parse);
$parse = str_replace("\r", '', $parse);
$parse = str_replace("\t", '', $parse);
My string looks something like this now:
Adress= 22 Examplary road, Nowhere&Phone= +123 12345678, +123 23456789&E-mail= info#example.com
Then I use parse_str() to get the values and then I take out the values if the needed keys are found:
parse_str($parse, $values);
$address = null;
if (isset($values['Adress']))
$address = trim($values['Adress']);
$phone = null;
if (isset($values['Phone']))
$phone = trim($values['Phone']);
The problem is that I end up with $phone = '371 12345678, 371 23456789' - I lose the + signs. How to conserve those?
Also, if you have any hints how to improve this procedure, I would be glad to know that. Some entries have Website: example.com, others have Web Site example.com... I am pretty sure that it will not be possible to automatically parse all of the information but I am looking for the best possible solution.
Solution
Using tips provided by WEBjuju I am now using this:
preg_match_all('/([^:]*):\s?(.*)\n/Usi', $string, $matches, PREG_SET_ORDER);
$values = [];
foreach ($matches as $match)
{
$key = strip_tags($match[1]);
$key = trim($key);
$key = mb_strtolower($key);
$key = str_replace("\s", '', $key);
$key = str_replace('-', '', $key);
$value = strip_tags($match[2]);
$value = trim($value);
$descriptionValues[$key] = $value;
}
This allows me to go from this input:
<b>Venue:</b> The Hall<br
<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b> +371 12345678<br>
<b>E-mail: </b>info#hkliepaja.lv<br>
<b>Website:</b> example.com<br>
To a nice PHP array with homogenized and hopefully recognizable keys:
[
'venue' => 'The Hall',
'adress' => '22 Examplary road, Nowhere',
'phone' => '+371 12345678',
'email' => 'info#example.com',
'website' => 'example.com',
];
It still doesn't account for the cases of missing colons, but I don't think I can solve that...
Realizing that you have preformed HTML that conforms to a simple standard structure I can tell you that regular expression matching will be the best way to grab this data. Here is an example to get you on your way - I'm sure it doesn't solve everything, but it solves what your issue is on this post, where you are troubled with "finding key/var matches".
// now go get those matches!
preg_match_all('/<b>([^:]*):\s?<\/b>(.*)<br>/Usi', $string, $matches, PREG_SET_ORDER);
die('<pre>'.print_r($matches,true));
That will output, for instance, something like this:
Array
(
[0] => Array
(
[0] => <b>Adress:</b> 22 Examplary road, Nowhere <br>
[1] => Adress
[2] => 22 Examplary road, Nowhere
)
[1] => Array
(
[0] => <b>Phone:</b> +371 12345678, +371 23456789<br>
[1] => Phone
[2] => +371 12345678, +371 23456789
)
[2] => Array
(
[0] => <b>E-mail: </b>info#example.com<br>
[1] => E-mail
[2] => info#example.com
)
And from there, I'd have to guess that you can putt that in for par.
Use base64_encode() before you put your value in your string. In the code where you receive this string, use base64_decode() to get it back.
page1.php
$string = '&Adress='.base64_encode('22 Examplary road, Nowhere').'&Phone='.base64_encode('+123 12345678, +123 23456789').'&Email='.base64_encode('info#example.com');
// string is sent via curl or some other transport to page2.php
page2.php
parse_str($string);
echo base64_decode($Adress); // 22 Examplary road, Nowhere
echo base64_decode($Phone); // +123 12345678, +123 23456789
echo base64_decode($Email); // info#example.com
I'm trying to match two types of strings using the preg_match function in PHP which could be the following.
'_mything_to_newthing'
'_onething'
'_mything_to_newthing_and_some_stuff'
In the third one above, I only want the "mything" and "newthing" so everything that comes after the third part is just some optional text the user could add. Ideally out of the regex would come in the cases of above;
'mything', 'newthing'
'onething'
'mything', 'newthing'
The patterns should match a-zA-Z0-9 if possible :-)
My regex is terrible, so any help would be appreciated!
Thanks in advanced.
Assuming you're talking about _ deliminated text:
$regex = '/^_([a-zA-Z0-9]+)(|_to_([a-zA-Z0-9]+).*)$/';
$string = '_mything_to_newthing_and_some_stuff';
preg_match($regex, $string, $match);
$match = array(
0 => '_mything_to_newthing_and_some_stuff',
1 => 'mything',
2 => '_to_newthing_and_some_stuff',
3 => 'newthing',
);
As far as anything farther, please provide more details and better sample text/output
Edit: You could always just use explode:
$parts = explode('_', $string);
$parts = array(
0 => '',
1 => 'mything',
2 => 'to',
3 => 'newthing',
4 => 'and',
5 => 'some',
6 => 'stuff',
);
As long as the format is consistent, it should work well...