Best way to parse this string and create an array from it - php

I have the follow string:
{item1:test},{item2:hi},{another:please work}
What I want to do is turn it into an array that looks like this:
[item1] => test
[item2] => hi
[another] => please work
Here is the code I am currently using for that (which works):
$vf = '{item1:test},{item2:hi},{another:please work}';
$vf = ltrim($vf, '{');
$vf = rtrim($vf, '}');
$vf = explode('},{', $vf);
foreach ($vf as $vk => $vv)
{
$ve = explode(':', $vv);
$vx[$ve[0]] = $ve[1];
}
My concern is; what if the value has a colon in it? For example, lets say that the value for item1 is you:break. That colon is going to make me lose break entirely. What is a better way of coding this in case the value has a colon in it?

Why not to set a limit on explode function. Like this:
$ve = explode(':', $vv, 2);
This way the string will split only at the first occurrence of a colon.

To address the possibility of the values having embedded colons, and for the sake of discussion (not necessarily performance):
$ve = explode(':', $vv);
$key = array_shift($ve);
$vx[$key] = implode(':', $ve);
...grabs the first element of the array, assuming the index will NOT have a colon in it. Then re-joins the rest of the array with colons.

Don't use effing explode for everything.
You can more reliably extract such simple formats with a trivial key:value regex. In particular since you have neat delimiters around them.
And it's far less code:
preg_match_all('/{(\w+):([^}]+)}/', $vf, $match);
$array = array_combine($match[1], $match[2]);
The \w+ just matches an alphanumeric string, and [^}]+ anything that until a closing }. And array_combine more easily turns it into a key=>value array.

Answering your second question:
If your format crashes with specific content it's bad. I think there are 2 types to work around.
Escape delimiters: that would be, every colon and curly brackets have to be escaped which is strange, so data is delimited with e.g. " and only those quotation marks are escaped (than you have JSON in this case)
Save data lengths: this is a bit how PHP serializes arrays. In that data structure you say, that the next n chars is one token.
The first type is easy to read and manipulate although one have to read the whole file to random access it.
The second type would be great for better random accessing if the structure doesn't saves the amount of characters (since in UTF-8 you cannot just skip n chars by not reading them), but saving the amount of bytes to skip. PHP's serialize function produce n == strlen($token), thus I don't know what is the advantage over JSON.
Where possible I try to use JSON for communication between different systems.

Related

extract a part from an expression

I have some expressions of the form aa/bbbb/c/dd/ee. I want to select only the part dd from it using a php code. Using 'substr' it can be done, but the problem is that the lengths of bbbb can vary from 3 (i.e., bbb) to 4, lengths of c can be 1 or 2 and the lengths of dd can be 2, 3 or 4. Then how can I extract the part dd (i.e, the part between the last pair / /).
Use explode to explode the string into an array and then grab the 4th item in the array which will be dd regardless of the size of the other elements, just make sure the number of '/' stays the same
If the structure of the expression always has the "/" separators even if the values in between are varying in length (or sometimes absent) you can use explode().
$parts_array = explode("/", $expression);
$dd = parts_array[3];
If the number of slashes varies, you'll have to do more work, like determining how many slashes there are and what parts of the expression are missing. That's a fair bit more complex.
Here you go, the PHP script for selecting the text between the last pair of "/../":
<?php
$expression = "aaa/vvv/bbbb/cccc/ddd/ee";
$mystuff = explode("/", $expression);
echo $mystuff[sizeof($mystuff)-2];
?>
I hope it helps. Good luck!

PHP Regex to identify keys in array representation

I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.
Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )
Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working

How to convert this PHP RegEx match pattern to .Net

This is a pretty complex regular expression that returns an array of key/value pairs from a proprietary string of data. Here is sample of the data, in case the express can not be used in .Net and another method needs to be used.
0,"101"1,"12345"11,"ABC Company"12,"John Doe"13,"123 Main St"14,""15,"Malvern"16,"PA"17,"19355"19,"UPS"21,"10"22,"GND"23,""24,"082310"25,""26,"0.00"29,"1Z1235550300000645"30," PA 193 9-05"34,"6.55"37,"6.55"38,"8.05"65,"1Z1235550300000645"77,"10"96,""97,""98
If you look closely you see its key,"value",key,"value" The only guarantee on formatting is that each key value pair is separated by a comma, and each value will always be encased in double quotes. The main problem (the reason you cant explode it) is the poor choice of the previous coder to separate keys and values with the same character as the entries. Anyways, out of my hands. Here is a working PHP example.
function parseResponse($response) {
// split response into $key, $value pieces
preg_match_all("/(.*?),\"(.*?)\"/", $response, $m);
// loop through pieces and format
foreach($m[1] as $index => $key) {
$value = $m[2][$index]
echo $key . ":" . $value;
// this will output KEY:VALUE for each entry in the string
}
}
You can see the expression /(.*?),\"(.*?)\"/
Here is what I have in VB .Net
Imports System.Text.RegularExpressions
Public Class Parser
Private Sub parseResponse(ByVal response As String)
Dim regExMatch As Match = Regex.Match(response, "/(.*?),\""(.*?)\""/")
End Sub
End Class
You need to remove the PHP delimiters:
Dim RegexObj As New Regex("(.*?),""(.*?)""")
Also, better be more specific about what can be matched (makes the regex much more efficient):
Dim RegexObj As New Regex("([^,]*),""([^""]*)""")
Now the first group only matches characters that aren't commas, and the second one only matches characters that aren't quotes. Both regexes would fail, by the way, if you were to have escaped quotes in your data.
To get all matches in a string, use
AllMatchResults = RegexObj.Matches(response)

Filter array of numeric PIN code strings which may be in the format "######" or "### ###"

I have a PHP array of strings. The strings are supposed to represent PIN codes which are of 6 digits like:
560095
Having a space after the first 3 digits is also considered valid e.g. 560 095.
Not all array elements are valid. I want to filter out all invalid PIN codes.
Yes you can make use of regex for this.
PHP has a function called preg_grep to which you pass your regular expression and it returns a new array with entries from the input array that match the pattern.
$new_array = preg_grep('/^\d{3} ?\d{3}$/',$array);
Explanation of the regex:
^ - Start anchor
\d{3} - 3 digits. Same as [0-9][0-9][0-9]
? - optional space (there is a space before ?)
If you want to allow any number of any whitespace between the groups
you can use \s* instead
\d{3} - 3 digits
$ - End anchor
Yes, you can use a regular expression to make sure there are 6 digits with or without a space.
A neat tool for playing with regular expressions is RegExr... here's what RegEx I came up with:
^[0-9]{3}\s?[0-9]{3}$
It matches the beginning of the string ^, then any three numbers [0-9]{3} followed by an optional space \s? followed by another three numbers [0-9]{3}, followed by the end of the string $.
Passing the array into the PHP function preg_grep along with the Regex will return a new array with only matching indeces.
If you just want to iterate over the valid responses (loop over them), you could always use a RegexIterator:
$regex = '/^\d{3}\s?\d{3}$/';
$it = new RegexIterator(new ArrayIterator($array), $regex);
foreach ($it as $valid) {
//Only matching items will be looped over, non-matching will be skipped
}
It has the benefit of not copying the entire array (it computes the next one when you want it). So it's much more memory efficient than doing something with preg_grep for large arrays. But it also will be slower if you iterate multiple times (but for a single iteration it should be faster due to the memory usage).
If you want to get an array of the valid PIN codes, use codaddict's answer.
You could also, at the same time as filtering only valid PINs, remove the optional space character so that all PINs become 6 digits by using preg_filter:
$new_array = preg_filter('/^(\d{3}) ?(\d{3})$/D', '$1$2', $array);
The best answer might depend on your situation, but if you wanted to do a simple and low cost check first...
$item = str_replace( " ", "", $var );
if ( strlen( $item ) !== 6 ){
echo 'fail early';
}
Following that, you could equally go on and do some type checking - as long as valid numbers did not start with a 0 in which case is might be more difficult.
If you don't fail early, then go on with the regex solutions already posted.

Any faster, simpler alternative to php preg_match

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.
I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.
If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.
Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)
If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.
strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.
Add tags manually? Just like we add tags here at SO.

Categories