Extracting specific parts of a predictably structured string with unpredictable contents - php

Ok, I have a complex problem for you guys.
I am trying to extract some values from a load of old data. It's a bunch of strings which are basically 7 parts concatenated with ||
test1||keep||1:1||test||3462||7885||test
Rules
Each section of the string could have any character in it, except | or two arrows like this <> (see further down) which are reserved as separators.
Any of the sections could be empty.
e.g. In this one the first 1st, 5th and 6th sections are empty, and the 3rd contains lots of non-alphanumeric characters.
||keep||test's\ (o-kay?).go_od||test||||||test
Furthermore...
Some of the strings are made up of multiple ones of these 7 pieces, further separated with <>
test1||keep||1:1||test||3462||7885||test<>test1||keep||1:1||test||3462||7885||test<>test1||keep||1:1||test||3462||7885||test
Remember, any of the inner sections could be empty.
test54||keep||test's\ (o-kay?).go_od||test||||||<>test||keep||test545's'/.||test||||test||test
The Goal
Extract just the second part of every string, and put into an array. In my examples above, it is every part which has the word keep inside.
So for this example:
||keep||test's\ (o-kay?).go_od||test||||||test
I want to get:
array('keep')
And for this example:
test1||keep-me||1:1||test||3462||7885||test<>||keep||||||3462||7885||<>test1||keep-me-too!||1:1||test||3462||||test
It can be seen as 3 different strings which are separated by <>:
test1||keep-me||1:1||test||3462||7885||test
||keep||||||3462||7885||
test1||keep-me-too!||1:1||test||3462||||test
And I want to extract:
array('keep-me', 'keep', 'keep-me-too!')
Notes
I have tried doing this with preg_match but look-behind doesn't like searching for non-fixed length strings.
I cannot change the data. It is old data I just have to work with.

$array = [];
$strings = explode('<>', $yourContent);
foreach ($strings as $string) {
$array[] = explode('||', $string)[1];
}
This uses array dereferencing introduced in PHP 5.4.

Related

How to count related word from a list of word separated with comma

Hello i am trying to count number of related word from a list of words that are separated with comma. In short i am try to work with tags and i want to get the first word and count it similar words. For example:
$words = 'php, codes, php, script, php, chat.';
In the above example the word PHP appears three time any possibly way to catch them in sperate form to out put this:
php - 3
codes - 1
script - 1
Mean while i was able to loop out the words from my db and separate each of them as a link but i also want to use them as tag references to count each tags words found.
Here is one way that works with PHP 7.2+:
<?php
$words = rtrim('php, codes, php, script, php, chat.',".");
$exploded = explode(", ",$words);
$arrCount = array_count_values($exploded);
var_dump($arrCount);
live code: https://3v4l.org/BNXjf
The code trims off the period punctuating the string. Then it explodes the string into the array $exploded using a 2-character string composed of a comma followed by a space character. Then the code defly uses the PHP 7 function array_count_values which creates a new array in this case $arrCount and this array's keys are the values of $exploded and its values are the number of times each value appears in $exploded as the var_dump() of $arrCount reveals.

extract a part from an expression

I have some expressions of the form aa/bbbb/c/dd/ee. I want to select only the part dd from it using a php code. Using 'substr' it can be done, but the problem is that the lengths of bbbb can vary from 3 (i.e., bbb) to 4, lengths of c can be 1 or 2 and the lengths of dd can be 2, 3 or 4. Then how can I extract the part dd (i.e, the part between the last pair / /).
Use explode to explode the string into an array and then grab the 4th item in the array which will be dd regardless of the size of the other elements, just make sure the number of '/' stays the same
If the structure of the expression always has the "/" separators even if the values in between are varying in length (or sometimes absent) you can use explode().
$parts_array = explode("/", $expression);
$dd = parts_array[3];
If the number of slashes varies, you'll have to do more work, like determining how many slashes there are and what parts of the expression are missing. That's a fair bit more complex.
Here you go, the PHP script for selecting the text between the last pair of "/../":
<?php
$expression = "aaa/vvv/bbbb/cccc/ddd/ee";
$mystuff = explode("/", $expression);
echo $mystuff[sizeof($mystuff)-2];
?>
I hope it helps. Good luck!

PHP Regex to identify keys in array representation

I have this string authors[0][system:id] and I need a regex that returns:
array('authors', '0', 'system:id')
Any ideas?
Thanks.
Just use PHP's preg_split(), which returns an array of elements similarly to explode() but with RegEx.
Split the string on [ or ] and the remove the last element (which is an empty string) of the provided array, $tokens.
EDIT: Also, remove the 3rd element with array_splice($array, int $offset, int $lenth), since this item is also an empty string.
The regex /[\[\]]/ just means match any [ or ] character
$string = "authors[0][system:id]";
$tokens = preg_split("/[\]\[]/", $string);
array_pop($tokens);
array_splice($tokens, 2, 1);
//rest of your code using $tokens
Here is the format of $tokens after this has run:
Array ( [0] => authors [1] => 0 [2] => system:id )
Taking the most simplistic approach, we would just match the three individual parts. So first of all we'd look for the token that is not enclosed in brackets:
[a-z]+
Then we'd look for the brackets and the value in between:
\[[^\]]+\]
And then we'd repeat the second step.
You'd also need to add capture groups () to extract the actual values that you want.
So when you put it all together you get something like:
([a-z]+)\[([^\]]+)\]\[([^\]]+)\]
That expression could then be used with preg_match() and the values you want would be extracted into the referenced array passed to the third argument (like this). But you'll notice the above expression is quite a difficult-to-read collection of punctuation, and also that the resulting array has an extra element on it that we don't want - preg_match() places the whole matched string into the first index of the output array. We're close, but it's not ideal.
However, as #AlienHoboken correctly points out and almost correctly implements, a simpler solution would be to split the string up based on the position of the brackets. First let's take a look at the expression we'd need (or at least, the one that I would use):
(?:\[|\])+
This looks for at least one occurence of either [ or ] and uses that block as delimiter for the split. This seems like exactly what we need, except when we run it we'll find we have a small issue:
array('authors', '0', 'system:id', '')
Where did that extra empty string come from? Well, the last character of the input string matches you delimiter expression, so it's treated as a split position - with the result that an empty string gets appended to the results.
This is quite a common issue when splitting based on a regular expression, and luckily PCRE knows this and provides a simple way to avoid it: the PREG_SPLIT_NO_EMPTY flag.
So when we do this:
$str = 'authors[0][system:id]';
$expr = '/(?:\[|\])+/';
$result = preg_split($expr, $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
...you will see the result you want.
See it working

PHP Extract Similar Parts from Multiple Strings

I'm trying to extract the parts which are similar from multiple strings.
The purpose of this is an attempt to extract the title of a book from multiple OCRings of the title page.
This applies to only the beginning of the string, the ends of the strings don't need to be trimmed and can stay as they are.
For example, my strings might be:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='published by xyz publisher the historv of the internot, expanded and';
$title[3]='history of the internet';
So basically I would want to trim each string so that it starts at the most probable starting point. Considering that there may be OCR errors (e.g. "historv", "internot") I thought it might be best to take the number of characters from each word, which would give me an array for each string (so a multi-dimensional array) with a the length of each word. This can then be used to find running matches and trim the beginnings of the string to the most likely.
The strings should be cut to:
$title[0]='the history of the internet, expanded and revised';
$title[1]='the history of the internet';
$title[2]='the historv of the internot, expanded and';
$title[3]='XXX history of the internet';
So I need to be able to recognize that "history of the internet" (7 2 3 8) is the run which matches all strings, and that the preceding "the" is most probably correct seeing as it occurs in >50% of the strings, and therefore the beginning of each string is trimmed to "the" and a placeholder of the same length is added onto the string missing "the".
So far I have got:
function CompareSimilarStrings($array)
{
$n=count($array);
// Get length of each word in each string >
for($run=0; $run<$n; $run++)
{
$temp=explode(' ',$array[$run]);
foreach($temp as $key => $val)
$len[$run][$key]=strlen($val);
}
for($run=0; $run<$n; $run++)
{
}
}
As you can see, I'm stuck on finding the running matches.
Any ideas?
You should look into Smith-Waterman algorithm for local string alignment. It is a dynamic programming algorithm which finds parts of the string which are similar in that they have low edit distance.
So if you want to try it out, here is a php implementation of the algorithm.

Any faster, simpler alternative to php preg_match

I am using cakephp 1.3 and I have textarea where users submit articles. On submit, I want to look into the article for certain key words and and add respective tags to the article.
I was thinking of preg_match, But preg_match pattern has to be string. So I would have to loop through an array(big).
Is there a easier way to plug in the keywords array for the pattern.
I appreciate all your help.
Thanks.
I suggest treating your array of keywords like a hash table. Lowercase the article text, explode by spaces, then loop through each word of the exploded array. If the word exists in your hash table, push it to a new array while keeping track of the number of times it's been seen.
I ran a quick benchmark comparing regex to hash tables in this scenario. To run it with regex 1000 times, it took 17 seconds. To run it with a hash table 1000 times, it took 0.4 seconds. It should be an O(n+m) process.
$keywords = array("computer", "dog", "sandwich");
$article = "This is a test using your computer when your dog is being a dog";
$arr = explode(" ", strtolower($article));
$tracker = array();
foreach($arr as $word){
if(in_array($word, $keywords)){
if(isset($tracker[$word]))
$tracker[$word]++;
else
$tracker[$word] = 1;
}
}
The $tracker array would output: "computer" => 1, "dog" => 2. You can then do the process to decide what tags to use. Or if you don't care about the number of times the keyword appears, you can skip the tracker part and add the tags as the keywords appear.
EDIT: The keyword array may need to be an inverted index array to ensure the fastest lookup. I am not sure how in_array() works, but if it searches, then this isn't as fast as it should be. An inverted index array would look like
array("computer" => 1, "dog" => 1, "sandwich" => 1); // "1" can be any value
Then you would do isset($keywords[$word]) to check if the word matches a keyword, instead of in_array(), which should give you O(1). Someone else may be able to clarify this for me though.
If you don't need the power of regular expressions, you should just use strpos().
You will still need to loop through the array of words, but strpos is much, much faster than preg_match.
Of course, you could try matching all the keywords using one single regexp, like /word1|word2|word3/, but I'm not sure it is what you are looking for. And also I think it would be quite heavy and resource-consuming.
Instead, you can try with a different approach, such as splitting the text into words and checking if the words are interesting or not. I would make use of str_word_count() using someting like:
$text = 'this is my string containing some words, some of the words in this string are duplicated, some others are not.';
$words_freq = array_count_values(str_word_count($text, 1));
that splits the text into words and counts occurrences. Then you can check with in_array($keyword, $words_freq) or array_intersect(array_keys($words_freq), $my_keywords).
If you are not interested, as I guess, to the keywords case, you can strtolower() the whole text before proceeding with words splitting.
Of course, the only way to determine which approach is the best is to setup some testing, by running various search functions against some "representative" and quite long text and measuring the execution time and resource usage (try microtime(TRUE) and memory_get_peak_usage() to benchmark this).
EDIT: I cleaned up a bit the code and added a missing semi-colon :)
If you want to look for multiple words from an array, then combine said array into an regular expression:
$regex_array = implode("|", array_map("preg_escape", $array));
preg_match_all("/($regex_array)/", $src, $tags);
This converts your array into /(word|word|word|word|word|...)/. The arrray_map and preg_escape part is optional, only needed if the $array might contain special characters.
Avoid strpos and loops for this case. preg_match is faster for searching after alternatives.
strtr()
If given two arguments, the second
should be an array in the form
array('from' => 'to', ...). The return
value is a string where all the
occurrences of the array keys have
been replaced by the corresponding
values. The longest keys will be tried
first. Once a substring has been
replaced, its new value will not be
searched again.
Add tags manually? Just like we add tags here at SO.

Categories