Parsing plain text that contains custom conditionals

Parsing plain text that contains custom conditionals - php

I presume this is a weird sort of thing that I'm looking for.
I have the following text string:
$string = "The compass is pointing <north|south|east|west> towards <London|Paris|Rome>";
Somehow I want to parse this to obtain any of the following outputs:
The compass is pointing north towards Paris
The compass is pointing south towards London
The compass is pointing east towards Rome
The compass is pointing east towards London
Etc.
For each set of < > in the text string I need to convert the contents into an array (using explode("|",$string)?), then run array_rand on that array to get the key for the option we will display, then just read the array and return that value.
The problem is, I have very next to no experience with text parsing, but I'd guess you'd use preg_replace in this type of problem.
I'd appreciate if anyone could help me get started.

You could use preg_replace_callback() to execute a function that chooses a random replacement.
$string = "The compass is pointing <north|south|east|west> towards <London|Paris|Rome>";
$callback = function ($match) {
$opts = explode('|', $match[1]);
return $opts[array_rand($opts)];
};
echo preg_replace_callback('/<(.+?)>/', $callback, $string);
working example
The pattern matches <, any stuff (.+), and >. The "lazy" quantifier ? makes the + stop when it finds the shortest match, rather than being "greedy" and looking for the longest possible match (which is default behavior). Without it, it would match all the way to the last >, which is too far.
The ( ) creates a subpattern, so while $match[0] would be what's matched by the entire pattern (including < >), $match[1] will only contain the subpattern (without < >).
The callback function is called each time a match is found, and it does exactly what you described -- explode() the list of options and return a random one. The return value then replaces the original match.

Related

Selecting thousands separator character with RegEx

I need to change the decimal separator in a given string that has numbers in it.
What RegEx code can ONLY select the thousands separator character in the string?
It need to only select, when there is number around it. For example only when 123,456 I need to select and replace ,
I'm converting English numbers into Persian (e.g: Hello 123 becomes Hello ۱۲۳). Now I need to replace the decimal separator with Persian version too. But I don't know how I can select it with regex. e.g. Hello 121,534 most become Hello ۱۲۱/۵۳۴
The character that needs to be replaced is , with /

Use a regular expression with lookarounds.
$new_string = preg_replace('/(?<=\d),(?=\d)/', '/', $string);
DEMO
(?<=\d) means there has to be a digit before the comma, (?=\d) means there has to be a digit after it. But since these are lookarounds, they're not included in the match, so they don't get replaced.

According to your question, the main problem you face is to convert the English number into the Persian.
In PHP there is a library available that can format and parse numbers according to the locale, you can find it in the class NumberFormatter which makes use of the Unicode Common Locale Data Repository (CLDR) to handle - in the end - all languages known to the world.
So converting a number 123,456 from en_UK (or en_US) to fa_IR is shown in this little example:
$string = '123,456';
$float = (new NumberFormatter('en_UK', NumberFormatter::DECIMAL))->parse($string);
var_dump(
(new NumberFormatter('fa_IR', NumberFormatter::DECIMAL))->format($float)
);
Output:
string(14) "۱۲۳٬۴۵۶"
(play with it on 3v4l.org)
Now this shows (somehow) how to convert the number. I'm not so firm with Persian, so please excuse if I used the wrong locale here. There might be options as well to tell which character to use for grouping, but for the moment for the example, it's just to show that conversion of the numbers is taken care of by existing libraries. You don't need to re-invent this, which is even a sort of miss-wording, this isn't anything a single person could do, or at least it would be sort of insane to do this alone.
So after clarifying on how to convert these numbers, question remains on how to do that on the whole text. Well, why not locate all the potential places looking for and then try to parse the match and if successful (and only if successful) convert it to the different locale.
Luckily the NumberFormatter::parse() method returns false if parsing did fail (there is even more error reporting in case you're interested in more details) so this is workable.
For regular expression matching it only needs a pattern which matches a number (largest match wins) and the replacement can be done by callback. In the following example the translation is done verbose so the actual parsing and formatting is more visible:
# some text
$buffer = <<<TEXT
it need to only select , when there is number around it. for example only
when 123,456 i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello 123" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello 121,534" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
TEXT;
# prepare formatters
$inFormat = new NumberFormatter('en_UK', NumberFormatter::DECIMAL);
$outFormat = new NumberFormatter('fa_IR', NumberFormatter::DECIMAL);
$bufferWithFarsiNumbers = preg_replace_callback(
'(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u',
function (array $matches) use ($inFormat, $outFormat) {
[$number] = $matches;
$result = $inFormat->parse($number);
if (false === $result) {
return $number;
}
return sprintf("< %s (%.4f) = %s >", $number, $result, $outFormat->format($result));
},
$buffer
);
echo $bufferWithFarsiNumbers;
Output:
it need to only select , when there is number around it. for example only
when < 123,456 (123456.0000) = ۱۲۳٬۴۵۶ > i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello < 123 (123.0000) = ۱۲۳ >" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello < 121,534 (121534.0000) = ۱۲۱٬۵۳۴ >" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
Here the magic is just two bring the string parts into action with the number conversion by making use of preg_replace_callback with a regular expression pattern which should match the needs in your question but is relatively easy to refine as you define the whole number part and false positives are filtered thanks to the NumberFormatter class:
pattern for Unicode UTF-8 strings
|
(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u
| | |
| grouping character |
| |
word boundary -----------------+
(play with it on regex101.com)
Edit:
To only match the same grouping character over multiple thousand blocks, a named reference can be created and referenced back to it for the repetition:
(\b[1-9]\d{0,2}(?:(?<grouping_char>[ ,.])\d{3}(?:(?&grouping_char)\d{3})*)?\b)u
(now this get's less easy to read, get it deciphered and play with it on regex101.com)
To finalize the answer, only the return clause needs to be condensed to return $outFormat->format($result); and the $outFormat NumberFormatter might need some more configuration but as it is available in the closure, this can be done when it is created.
(play with it on 3v4l.org)
I hope this is helpful and opens up a broader picture to not look for solutions only because hitting a wall (and only there). Regex alone most often is not the answer. I'm pretty sure there are regex-freaks which can give you a one-liner which is pretty stable, but the context of using it will not be very stable. However not saying there is only one answer. Instead bringing together different levels of doings (divide and conquer) allows to rely on a stable number conversion even if yet still unsure on how to regex-pattern an English number.

You can write a regex to capture numbers with thousand separator, and then aggregate the two numeric parts with the separator you want :
$text = "Hello, world, 121,534" ;
$pattern = "/([0-9]{1,3}),([0-9]{3})/" ;
$new_text = preg_replace($pattern, "$1X$2", $text); // replace comma per 'X', keep other groups intact.
echo $new_text ; // Hello, world, 121X534

In PHP you can do that using str_replace
$a="Hello 123,456";
echo str_replace(",", "X", $a);
This will return: Hello 123X456

Problems with Router / URL Matcher class

since regular expressions aren't really my specialty, I need help with this little problem (in PHP).
I want to match a given url with an array of defined routes, e.g.:
$definedRoute = '/admin/user/[:id]/edit';
$url = '/admin/user/37/edit';
In my class, it would be like this, I imagine (getRoutes() returns an array of defined routes):
foreach ($this->getRoutes() as $route) {
$pattern = '~' . preg_replace('~\[\:[a-z]+\]~', '[a-z0-9]+',
str_replace('/', '\/', $route['definition'])) . '~';
if (preg_match($pattern, $url)) {
$parameters = $this->getRouteParameters($route['definition']);
(new $route['class']())->{$route['method']}($parameters);
// die? break?
}
}
I went about it like this: replace every occurence of a named parameter like [:id] with a regex for lowercase letters and numbers, e.g. [a-z0-9]+.
This would actually work but in some cases, it would match multiple and therefore the wrong routes. Also, it would always match ~\/~ in most cases. But every url should only be matched once.
Edit #1: the problem is: routes get matched multiple times. How can I prevent this?
Can someone enlighten me?

I don't know if this will cover every conceivable case, but you can use preg_match_all or preg_match as opposed to iterating over the patterns. It should also improve performance.
What this does is make the match order (left to right) important, with an array and loop you cannot do that (you can actually but it's uglier). Then we can sort the routes on complexity, like this:
//this is intentionally in the opposite order of what I want it.
$routes = ['definition' => ['\/', '\/admin\/user\/[a-z0-9]+\/']];
//the more / separators the closer to the beginning we want it. or the more complex regexs go first.
uasort($routes['definition'], function($a,$b){
//count the number of / in the route
//note the <=> spaceship (as it's called) is only available in PHP7+
return substr_count($b, '/') <=> substr_count($a, '/');
});
$url = '/admin/user/37/edit';
//in regex the pipe | is OR
preg_match('~^('.implode('|', $routes['definition']).')~i', $url, $matches);
print_r($matches);
Sandbox
Output:
Array
(
[0] => Array
(
[0] => /admin/user/37/edit
)
[1] => Array
(
[0] => /admin/user/37/edit
)
)
Which is correct in this case. Even if you do get multiple matches, you can count the length of them strlen and then take the longest or "best" match from them. This is pretty simple using strlen and probably sorting by length, so I will leave it up to you.
However I wouldn't call this a guarantee of it working 100% of the time, it's just the first thing that came to me.
Another Idea
Another idea is you are not anchoring the match to the start and end of the string. In theory the route could/should match the entire string so with my above example if you add ^ and $ here:
preg_match('~^('.implode('|', $routes['definition']).')$~i', $url, $matches);
This will ensure a full match and in this case ~\/~ will not match even if the array is not sorted, as you can see in the sandbox below.
Sandbox
That said it's not inconceivable you would only have/need a partial match. This is up to you and how you build your routes and URLs. You can of course just use the ^ start as with a begins with type of match, but you would need to sort them in that case.
Preg Match vs Preg Match All
Preg match will also work, but it will only return the first match. So if it matches multiple time you cannot compare them to find the best one. This may be fine if you use ^ and $.
Hope it helps.

PHP finding "words" within a string

I need to compare 2 lists of strings against each other and output strings which contain the strings searched for. should be very easy, i just can't figure it out.
to overly simplify it, let's use arrays. I am accessing an API with SOAP and running it against my own list contained in a table, but.... let's use arrays. the comparison is what i'm having trouble with.
hit submit button on listsearch.php and it executes.
ARRAY Mylist : TED, DEAD, FIRST, LAST, PUPPY
ARRAY TheirList..<br> teddybearnoose, <br>hauntedhouse, <br>hehasdeparted, <br>deadmouse, <br>walkingdead, <br>thegratefuldead, <br>firstkiss, <br>thinkfirst,<br> firsttobelast,<br> firstmanonthemoon, <br>firstreattempted, <br>somecrap, <br>something, <br>notdisplayed, <br>50000otherwords,<br> miscjunk
outputs as:
TEDdybearnoose<br>
haunTEDhouse<br>
hehasdeparTED<br>
DEADmouse<br>
walkingDEAD<br>
thegratefulDEAD<br>
FIRSTkiss<br>
thinkFIRST<br>
FIRSTtobeLAST <--- note<br>
FIRSTmanonthemoon<br>
FIRSTreattempTED <--- note<br>
<br>
only outputs strings which contain a string in my list, in any position. CAPS is just to make the words stand out to you. not important.
now, part 2?
same "TheirList", except i type a keyword into a text area, and select whether i want it at the beginning end or anywhere from a dropdown.
keywordsearch.php
search for: [ TED ] at: [beginning / end / anywhere] of string.
how would you make that one work?
Thanks in advance. This should be a breeze for most of you. I appreciate it. i'll try to answer questions promptly

You can use strpos() to find the position of a substring (docs).
It makes it very easy to check whether the substring occurred at the beginning or at the end of the string:
// String contains substring
strpos($string, $substring) !== false;
// String starts with substring
strpos($string, $substring) === 0;
// String ends with substring
strpos($string, $substring) === strlen($string) - strlen($substring);

How to find if two characters are in an array php

I am looking to develop a search function that allows users to just search for the item, or modify their search with a price range in brackets. So that is to say if they are looking for a car, then they can enter either car and receive all cars in the database or they can enter car (100, 299) or car(100, 299) and receive only cars in the database with the price range of 100 to 299.
Before what I did was three different explode function calls, but that was cumbersome and looked ridiculously ugly. I also tried to put the the brackets in an array and then compare that against the word searched (a word is basically an array of characters) but that didn't work. Finally I have been reading up on strpos and substr but they don't seem to fit the requirements as strpos returns the first occurrence of the the character and substr returns the characters within a specified length after a specific occurrence.
So for example the problem with strpos is the user can just enter ( and no ) bracket and I'll make a call to my search function with who knows what. And for example the problem with substr is that the price range can vary wildly.

You can use preg_match to parse the search string - I'm assuming that's the part you're having issues with.
if (preg_match('/car ?\(([^,]+), ?([^\)]+)\)/', $search_text, $matches)) {
$low_price = $matches[1];
$high_price = $matches[2];
//do your price filtering here
}
The regular expression may need a little tweaking, I don't remember offhand if parentheses need to be escaped in character classes.

Yes, Sam is right. You should do this with regular expressions.
Look for preg_match() on the documentation
To complete his answer, the regular expression for your case is:
$regex = "^([a-zA-Z]+)\s\(([0-9]+),([0-9]+)\)$"
if (preg_match($regex, $search_text, $matches)) {
$type = $matches[0];
$low_price = $matches[1];
$high_price = $matches[2];
//do your price filtering here
}
Be careful, as the array containing matches starts at index 0, not one.

preg_replace or regex string translation

I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.

I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing plain text that contains custom conditionals - php

Related

Selecting thousands separator character with RegEx

Problems with Router / URL Matcher class

PHP finding "words" within a string

How to find if two characters are in an array php

preg_replace or regex string translation

Categories

Resources