Extract values from string divided by pipes not in regular order - php

I've a part of an xml that i'm importing that is not in a regular order. It could be for example:
87||1|#88||2|#89||2|50198#41||1|#3||1|117#20|||#6|20|1|#24|||78#145|5||#36|||90#37|||96#29|||67#26|||#27|||#25|||
I create a function like this:
function caratteristiche1($title) {
$title=substr($title,11,1);
return $title;
}
to receive the value of #88 but sometimes #88 is not in that position, and other times #88 is not present.
I would like to use a function that will search for #88 and give as result the value present after the 2 subseguent pipes.
What can i do?
Thank you so much!

If the distance between the #88 and the two pipes is identical in all cases, you can use a Regular Expression with a "positive lookbehind assertion", see http://php.net/manual/en/regexp.reference.assertions.php
If not, you probably need to use plain PHP to locate the #88, then take the remaining substring from there on and search for the two pipes.
EDIT after your answer of "NO":
In this case, I wouldn't do it with RegEx.
Try something like that (might not be 100% accurate syntax):
$pos88 = strpos('#88', $string);
newStr = substr($string, $pos88);
$posPipes = strpos('||', $newStr);
// ... and so forth ...

Related

Selecting thousands separator character with RegEx

I need to change the decimal separator in a given string that has numbers in it.
What RegEx code can ONLY select the thousands separator character in the string?
It need to only select, when there is number around it. For example only when 123,456 I need to select and replace ,
I'm converting English numbers into Persian (e.g: Hello 123 becomes Hello ۱۲۳). Now I need to replace the decimal separator with Persian version too. But I don't know how I can select it with regex. e.g. Hello 121,534 most become Hello ۱۲۱/۵۳۴
The character that needs to be replaced is , with /
Use a regular expression with lookarounds.
$new_string = preg_replace('/(?<=\d),(?=\d)/', '/', $string);
DEMO
(?<=\d) means there has to be a digit before the comma, (?=\d) means there has to be a digit after it. But since these are lookarounds, they're not included in the match, so they don't get replaced.
According to your question, the main problem you face is to convert the English number into the Persian.
In PHP there is a library available that can format and parse numbers according to the locale, you can find it in the class NumberFormatter which makes use of the Unicode Common Locale Data Repository (CLDR) to handle - in the end - all languages known to the world.
So converting a number 123,456 from en_UK (or en_US) to fa_IR is shown in this little example:
$string = '123,456';
$float = (new NumberFormatter('en_UK', NumberFormatter::DECIMAL))->parse($string);
var_dump(
(new NumberFormatter('fa_IR', NumberFormatter::DECIMAL))->format($float)
);
Output:
string(14) "۱۲۳٬۴۵۶"
(play with it on 3v4l.org)
Now this shows (somehow) how to convert the number. I'm not so firm with Persian, so please excuse if I used the wrong locale here. There might be options as well to tell which character to use for grouping, but for the moment for the example, it's just to show that conversion of the numbers is taken care of by existing libraries. You don't need to re-invent this, which is even a sort of miss-wording, this isn't anything a single person could do, or at least it would be sort of insane to do this alone.
So after clarifying on how to convert these numbers, question remains on how to do that on the whole text. Well, why not locate all the potential places looking for and then try to parse the match and if successful (and only if successful) convert it to the different locale.
Luckily the NumberFormatter::parse() method returns false if parsing did fail (there is even more error reporting in case you're interested in more details) so this is workable.
For regular expression matching it only needs a pattern which matches a number (largest match wins) and the replacement can be done by callback. In the following example the translation is done verbose so the actual parsing and formatting is more visible:
# some text
$buffer = <<<TEXT
it need to only select , when there is number around it. for example only
when 123,456 i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello 123" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello 121,534" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
TEXT;
# prepare formatters
$inFormat = new NumberFormatter('en_UK', NumberFormatter::DECIMAL);
$outFormat = new NumberFormatter('fa_IR', NumberFormatter::DECIMAL);
$bufferWithFarsiNumbers = preg_replace_callback(
'(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u',
function (array $matches) use ($inFormat, $outFormat) {
[$number] = $matches;
$result = $inFormat->parse($number);
if (false === $result) {
return $number;
}
return sprintf("< %s (%.4f) = %s >", $number, $result, $outFormat->format($result));
},
$buffer
);
echo $bufferWithFarsiNumbers;
Output:
it need to only select , when there is number around it. for example only
when < 123,456 (123456.0000) = ۱۲۳٬۴۵۶ > i need to select and replace "," I'm converting English
numbers into Persian (e.g: "Hello < 123 (123.0000) = ۱۲۳ >" becomes "Hello ۱۲۳"). now I need to
replace the Decimal separator with Persian version too. but I don't know how
I can select it with regex. e.g: "Hello < 121,534 (121534.0000) = ۱۲۱٬۵۳۴ >" most become
"Hello ۱۲۱/۵۳۴" The character that needs to be replaced is , with /
Here the magic is just two bring the string parts into action with the number conversion by making use of preg_replace_callback with a regular expression pattern which should match the needs in your question but is relatively easy to refine as you define the whole number part and false positives are filtered thanks to the NumberFormatter class:
pattern for Unicode UTF-8 strings
|
(\b[1-9]\d{0,2}(?:[ ,.]\d{3})*\b)u
| | |
| grouping character |
| |
word boundary -----------------+
(play with it on regex101.com)
Edit:
To only match the same grouping character over multiple thousand blocks, a named reference can be created and referenced back to it for the repetition:
(\b[1-9]\d{0,2}(?:(?<grouping_char>[ ,.])\d{3}(?:(?&grouping_char)\d{3})*)?\b)u
(now this get's less easy to read, get it deciphered and play with it on regex101.com)
To finalize the answer, only the return clause needs to be condensed to return $outFormat->format($result); and the $outFormat NumberFormatter might need some more configuration but as it is available in the closure, this can be done when it is created.
(play with it on 3v4l.org)
I hope this is helpful and opens up a broader picture to not look for solutions only because hitting a wall (and only there). Regex alone most often is not the answer. I'm pretty sure there are regex-freaks which can give you a one-liner which is pretty stable, but the context of using it will not be very stable. However not saying there is only one answer. Instead bringing together different levels of doings (divide and conquer) allows to rely on a stable number conversion even if yet still unsure on how to regex-pattern an English number.
You can write a regex to capture numbers with thousand separator, and then aggregate the two numeric parts with the separator you want :
$text = "Hello, world, 121,534" ;
$pattern = "/([0-9]{1,3}),([0-9]{3})/" ;
$new_text = preg_replace($pattern, "$1X$2", $text); // replace comma per 'X', keep other groups intact.
echo $new_text ; // Hello, world, 121X534
In PHP you can do that using str_replace
$a="Hello 123,456";
echo str_replace(",", "X", $a);
This will return: Hello 123X456

How to effectively match a string with lots of regular expressions

I want to be able to effectively match a string with a number of regular expressions to determine what this string represents.
^[0-9]{1}$ if string matches it is of type 1
^[a-x]{300}$ if string matches it is of type 2
... ...
Iterating over a collection containing all of the regular expressions every time I want to match a string is way too heavy for me.
Is there any more effective way? Maybe I can compile these regexps into one big one? Maybe something that works like Google Suggestions, analysing letter after letter?
In my project, I am using PHP/MySQL, however I will be thankful for a clue in any language.
Edit:
Operation of matching a string will be very frequent and string values will vary.
What you could do, if possible, is grouping your regexes together and determine in which group a string belongs.
For instance, if a string doesn't match \d, you know there is no digit in it and you can skip all regexes that require one. So (for instance) instead of matching against +300 regexes, you can narrow that down to just 25.
You can sum up your regexes like this:
^([0-9])|([a-x]{300})$
Later, if you get more regex, you can do this:
^([0-9])|([a-x]{300})|([x-z]{1,5})|([ab]{2,})$...
Then use this code:
$input=...
preg_match_all('#^([0-9])|([a-x]{300})$#', $input, $matches);
foreach ($matches as $val) {
if (isset($val[1])) {
// type 1
} else if (isset($val[2])) {
// type 2
}
// and so on...
}
Since the regexes are going to be changing, I don't think you can get a generic answer - both your regex(es), and the way you handle them will need to evolve. For now, if you're looking to optimize the processing of your script, test for known strings before evaluating using something like indedOf to lighten the regex load.
For instance, if you have 4 strings:
asdfsdfkjslkdujflkj2lkjsdlkf2lkja
100010010100111010100101001001011
101032021309420940389579873987113
asdfkajhslkdjhflkjshdlfkjhalksjdf
Each belongs to a different "type" as you've described it, so you could do:
//type 1 only contains 0 or 1
//type 2 must have a "2"
//type 3 contains only letters
var arr = [
"asdfsdfkjslkdujflkj2lkjsdlkf2lkja",
"100010010100111010100101001001011",
"101032021309420940389579873987113",
"asdfkajhslkdjhflkjshdlfkjhalksjdf"
];
for (s in arr)
{
if (arr[s].indexOf('2') > 0)
{
//type 2
}
else if (arr[s].indexOf('0') > 0)
{
if ((/^[01]+$/g).test(arr[s]))
//type 1
else
//ignore
}
else if ((/^[a-z]+$/gi).test(arr[s]))
//type 3
}
See it in action here: http://jsfiddle.net/remus/44MdX/

How to find if two characters are in an array php

I am looking to develop a search function that allows users to just search for the item, or modify their search with a price range in brackets. So that is to say if they are looking for a car, then they can enter either car and receive all cars in the database or they can enter car (100, 299) or car(100, 299) and receive only cars in the database with the price range of 100 to 299.
Before what I did was three different explode function calls, but that was cumbersome and looked ridiculously ugly. I also tried to put the the brackets in an array and then compare that against the word searched (a word is basically an array of characters) but that didn't work. Finally I have been reading up on strpos and substr but they don't seem to fit the requirements as strpos returns the first occurrence of the the character and substr returns the characters within a specified length after a specific occurrence.
So for example the problem with strpos is the user can just enter ( and no ) bracket and I'll make a call to my search function with who knows what. And for example the problem with substr is that the price range can vary wildly.
You can use preg_match to parse the search string - I'm assuming that's the part you're having issues with.
if (preg_match('/car ?\(([^,]+), ?([^\)]+)\)/', $search_text, $matches)) {
$low_price = $matches[1];
$high_price = $matches[2];
//do your price filtering here
}
The regular expression may need a little tweaking, I don't remember offhand if parentheses need to be escaped in character classes.
Yes, Sam is right. You should do this with regular expressions.
Look for preg_match() on the documentation
To complete his answer, the regular expression for your case is:
$regex = "^([a-zA-Z]+)\s\(([0-9]+),([0-9]+)\)$"
if (preg_match($regex, $search_text, $matches)) {
$type = $matches[0];
$low_price = $matches[1];
$high_price = $matches[2];
//do your price filtering here
}
Be careful, as the array containing matches starts at index 0, not one.

Regular expression to match an exact number of occurrence for a certain character

I'm trying to check if a string has a certain number of occurrence of a character.
Example:
$string = '123~456~789~000';
I want to verify if this string has exactly 3 instances of the character ~.
Is that possible using regular expressions?
Yes
/^[^~]*~[^~]*~[^~]*~[^~]*$/
Explanation:
^ ... $ means the whole string in many regex dialects
[^~]* a string of zero or more non-tilde characters
~ a tilde character
The string can have as many non-tilde characters as necessary, appearing anywhere in the string, but must have exactly three tildes, no more and no less.
As single character is technically a substring, and the task is to count the number of its occurences, I suppose the most efficient approach lies in using a special PHP function - substr_count:
$string = '123~456~789~000';
if (substr_count($string, '~') === 3) {
// string is valid
}
Obviously, this approach won't work if you need to count the number of pattern matches (for example, while you can count the number of '0' in your string with substr_count, you better use preg_match_all to count digits).
Yet for this specific question it should be faster overall, as substr_count is optimized for one specific goal - count substrings - when preg_match_all is more on the universal side. )
I believe this should work for a variable number of characters:
^(?:[^~]*~[^~]*){3}$
The advantage here is that you just replace 3 with however many you want to check.
To make it more efficient, it can be written as
^[^~]*(?:~[^~]*){3}$
This is what you are looking for:
EDIT based on comment below:
<?php
$string = '123~456~789~000';
$total = preg_match_all('/~/', $string);
echo $total; // Shows 3

Any faster, simpler alternative to php preg_match

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.
I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.
If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.
Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)
If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.
strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.
Add tags manually? Just like we add tags here at SO.

Categories