regular expression php to parse a file

regular expression php to parse a file - php

I want to parse a file, and store it into an Array in PHP. However, there are some rules which should the observed:
(p="value") should be ignored, but the "value" should be preserved.
- should be ignored.
whitespaces should be ignored.
split by \t and \n.
A sample string is :
NPD4196-2a_5_0
Geldanamycin - 0.166516 (p = 0.0068) Alamethicin - 0.158302 (p = 0.0206) 4-Hydroxytamoxifen - 0.1429 (p = 0.0183) Abietic acid - 0.133045 (p = 0.0203) Caspofungin - 0.130885 (p = 0.0432) Extract 00-303C - 0.12858 (p = 0.0356) U73122 - 0.113274 (p = 0.0482) Radicicol - 0.10213 (p = 0.0356) Calcium ionophore - 0.096183 (p = 0.0262)
So, the goal is to produce a data structure like:
Array('NPD4196-2a_5_0' => Array(Array( 0 => 'Geldanamycin', 1 => '0.166516', 2 => '0.0068'), Array( ... ));
I have this written so far ...
while(($line = fgets($fp)) !== false){
$args = preg_split( '/[\t\n (=) ]+/', $line, -1, PREG_SPLIT_NO_EMPTY );
if(count($args)){
print_r($args);
print "\n";
}
}
What am I missing in other to accomplish my goal?
Thanks

(.+?)-\s*([\d\.]+)\s*\(p\s*=\s*([\d\.]+)\)
That will grab the element (e.g. Geldanamycin) in group 1, the related value in group 2, and the p value in group 3.
Play with the regex here.

This seems to work for one key-value pair (assuming NPD4196-2a_5_0 is the key in your example, and the second line is the value).
<?php
$fp = fopen('foo.txt', 'r');
$regex = '/(\w*)\s*-\s*([\d\.]+)\s*\(p\s*=\s*([\d\.]+)\)/';
$id = "NO ID";
$result = Array();
while(($line = fgets($fp)) !== false){
if (!preg_match($regex, $line)) {
$id = chop($line);
} else {
$all = Array();
while (preg_match($regex, $line, $matches, PREG_OFFSET_CAPTURE)) {
$last = end($matches);
$line = substr($line, $last[1] + strlen($last[0]) + 1);
$strings = Array();
for ($i = 1; $i < 4; $i++) {
array_push($strings, $matches[$i][0]);
}
array_push($all, $strings);
}
$result[$id] = $all;
}
}
print_r($result);
?>
(That is a slightly edited version of David B's regex.)
If the line doesn't match that long RegEx pattern, it will store the line as the ID. Otherwise, it will match the RegEx, then chop off the matching part. Each iteration of the inner while loop will match one entry. Since I am grabbing the indices of the matches, the for loop is used to only add the strings to the result.
This prints:
Array
(
[NPD4196-2a_5_0] => Array
(
[0] => Array
(
[0] => Geldanamycin
[1] => 0.166516
[2] => 0.0068
)
[1] => Array
(
[0] => Alamethicin
[1] => 0.158302
[2] => 0.0206
)
[2] => Array
(
[0] => Hydroxytamoxifen
[1] => 0.1429
[2] => 0.0183
)
...

Related

PHP foreach() loop

$string = "The complete archive of The New York Times can now be searched from NYTimes.com " //the actual input is unknown, it would be read from textarea
$size = the longest word length from the string
I assigned and initialized array in for loop, for example array1, array2 ....arrayN, here is how i did
for ($i = 1; $i <= $size; $i++) {
${"array" . $i} = array();
}
so the $string would be divided in the length of the word
$array1 = [""];
$array2 = ["of", "be", ...]
$array3 = ["the", "can", "now", ...] and so on
So, my question is how to assign in simple for loop or foreach loop $string value to $array1, $array2, $array3 ....., since the input text or the size of the longest word is unknown

I'd probably start with $words = explode(' ', $string)
then sort the string by word length
usort($words, function($word1, $word2) {
if (strlen($word1) == strlen($word2)) {
return 0;
}
return (strlen($word1) < strlen($word2)) ? -1 : 1;
});
$longestWordSize = strlen(last($words));
Loop over the words and place in their respective buckets.
Rather than separate variables for each length array, you should consider something like
$sortedWords = array(
1 => array('a', 'I'),
2 => array('to', 'be', 'or', 'is'),
3 => array('not', 'the'),
);
by looping over the words you don't need to know the maximum word length.
The final solution is as simple as
foreach ($words as $word) {
$wordLength = strlen($word);
$sortedWords[ $wordLength ][] = $word;
}

You could use something like this:
$words = explode(" ", $string);
foreach ($words as $w) {
array_push(${"array" . strlen($w)}, $w);
}
This splits up $string into an array of $words and then evaluates each word for length and pushes that word to the appropriate array.

you can use explode().
$string = "The complete archive of The New York Times can now be searched from NYTimes.com " ;
$arr=explode(" ",$string);
$count=count($arr);
$big=0;
for ($i = 0; $i < $count; $i++) {
$p=strlen($arr[$i]);
if($big<$p){ $big_val=$arr[$i]; $big=$p;}
}
echo $big_val;

Just use the word length as the index and append [] each word:
foreach(explode(' ', $string) as $word) {
$array[strlen($word)][] = $word;
}
To remove duplicates $array = array_map('array_unique', $array);.
Yields:
Array
(
[3] => Array
(
[0] => The
[2] => New
[3] => can
[4] => now
)
[8] => Array
(
[0] => complete
[1] => searched
)
[7] => Array
(
[0] => archive
)
[2] => Array
(
[0] => of
[1] => be
)
[4] => Array
(
[0] => York
)
[5] => Array
(
[0] => Times
)
)
If you want to re-index the main array use array_values() and to re-index the subarrays use array_map() with array_values().

PHP replace symbols all possible variants

I have array symbols what I want replace, but I need generate all possibillity
$lt = array(
'a' => 'ą',
'e' => 'ę',
'i' => 'į',
);
For example if I have this string:
tazeki
There can be huge amount of results:
tązeki
tazęki
tązęki
tazekį
tązekį
tazękį
tązękį
My question is what formula use to have all variants ?

Here is a solution particularly for your task. You can pass any word and any array for replacements, it should work.
<?php
function getCombinations($word, $charsReplace)
{
$charsToSplit = array_keys($charsReplace);
$pattern = '/('.implode('|', $charsToSplit).')/';
// split whole word into parts by replacing symbols
$parts = preg_split($pattern, $word, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$replaceParts = array();
$placeholder = '';
// create string with placeholders (%s) for sptrinf and array of replacing symbols
foreach ($parts as $wordPart) {
if (isset($charsReplace[$wordPart])) {
$replaceParts[] = $wordPart;
$placeholder .= '%s';
} else {
$placeholder .= $wordPart;
}
}
$paramsCnt = count($replaceParts);
$combinations = array();
$combinationsCnt = pow(2, $paramsCnt);
// iterate all combinations (with help of binary codes)
for ($i = 0; $i < $combinationsCnt; $i++) {
$mask = sprintf('%0'.$paramsCnt.'b', $i);
$sprintfParams = array($placeholder);
foreach ($replaceParts as $index => $char) {
$sprintfParams[] = $mask[$index] == 1 ? $charsReplace[$char] : $char;
}
// fill current combination into placeholder and collect it in array
$combinations[] = call_user_func_array('sprintf', $sprintfParams);
}
return $combinations;
}
$lt = array(
'a' => 'ą',
'e' => 'ę',
'i' => 'į',
);
$word = 'stazeki';
$combinations = getCombinations($word, $lt);
print_r($combinations);
// Оutput:
// Array
// (
// [0] => stazeki
// [1] => stazekį
// [2] => stazęki
// [3] => stazękį
// [4] => stązeki
// [5] => stązekį
// [6] => stązęki
// [7] => stązękį
// )

This is an implementation in PHP :
<?php
/**
* String variant generator
*/
class stringVariantGenerator
{
/**
* Contains assoc of char => array of all its variations
* #var array
*/
protected $_mapping = array();
/**
* Class constructor
*
* #param array $mapping Assoc array of char => array of all its variation
*/
public function __construct(array $mapping = array())
{
$this->_mapping = $mapping;
}
/**
* Generate all variations
*
* #param string $string String to generate variations from
*
* #return array Assoc containing variations
*/
public function generate($string)
{
return array_unique($this->parseString($string));
}
/**
* Parse a string and returns variations
*
* #param string $string String to parse
* #param int $position Current position analyzed in the string
* #param array $result Assoc containing all variations
*
* #return array Assoc containing variations
*/
protected function parseString($string, $position = 0, array &$result = array())
{
if ($position <= strlen($string) - 1)
{
if (isset($this->_mapping[$string{$position}]))
{
foreach ($this->_mapping[$string{$position}] as $translatedChar)
{
$string{$position} = $translatedChar;
$this->parseString($string, $position + 1, $result);
}
}
else
{
$this->parseString($string, $position + 1, $result);
}
}
else
{
$result[] = $string;
}
return $result;
}
}
// This is where you define what are the possible variations for each char
$mapping = array(
'e' => array('#', '_'),
'p' => array('*'),
);
$word = 'Apple love!';
$generator = new stringVariantGenerator($mapping);
print_r($generator->generate($word));
It would return :
Array
(
[0] => A**l# lov#!
[1] => A**l# lov_!
[2] => A**l_ lov#!
[3] => A**l_ lov_!
)
In your case, if you want to use the letter itself as a valid translated value, just add it into the array.
$lt = array(
'a' => array('a', 'ą'),
'e' => array('e', 'ę'),
'i' => array('i', 'į'),
);

I'm not sure if you can do this with keys and value but as two arrays definatley.
$find = array('ą','ę','į');
$replace = array('a', 'e', 'i');
$string = 'tązekį';
echo str_replace($find, $replace, $string);

I'm not sure If I understand your question, but here is my answer :-)
$word = 'taxeki';
$word_arr = array();
$word_arr[] = $word;
//Loop through the $lt-array where $key represents what char to search for
//$letter what to replace with
//
foreach($lt as $key=>$letter) {
//Loop through each char in the $word-string
for( $i = 0; $i <= strlen($word)-1; $i++ ) {
$char = substr( $word, $i, 1 );
//If current letter in word is same as $key from $lt-array
//then add a word the $word_arr where letter is replace with
//$letter from the $lt-array
if ($char === $key) {
$word_arr[] = str_replace($char, $letter, $word);
}
}
}
var_dump($word_arr);

I'm assuming you have a known number of elements in your array, and I am assuming that that number is 3. You will have to have additional loops if you have additional elements in your $lt array.
$lt = array(
'a' => array('a', 'x'),
'e' => array('e', 'x'),
'i' => array('i', 'x')
);
$str = 'tazeki';
foreach ($lt['a'] as $a)
foreach ($lt['e'] as $b)
foreach ($lt['i'] as $c) {
$newstr = str_replace(array_keys($lt), array($a, $b, $c), $str);
echo "$newstr<br />\n";
}
If the number of elements in $lt is unknown or variable then this is not a good solution.

Well, though #Rizier123 and others have already provided good answers will clear explanations, I would like to leave my contribution as well. This time, honoring the Way of the Short Source Code over readability ... ;-)
$lt = array('a' => 'ą', 'e' => 'ę', 'i' => 'į');
$word = 'tazeki';
for ($i = 0; $i < strlen($word); $i++)
$lt[$word[$i]] && $r[pow(2, $u++)] = [$lt[$word[$i]], $i];
for ($i = 1; $i < pow(2, count($r)); $i++) {
for ($w = $word, $u = end(array_keys($r)); $u > 0; $u >>= 1)
($i & $u) && $w = substr_replace($w, $r[$u][0], $r[$u][1], 1);
$res[] = $w;
}
print_r($res);
Output:
Array
(
[0] => tązeki
[1] => tazęki
[2] => tązęki
[3] => tazekį
[4] => tązekį
[5] => tazękį
[6] => tązękį
)

PHP find positions of all occurrences of a particular word in a string

This is slightly different to finding all the positions of a substring inside a string because I want it to work with words which may be followed by a space, comma, semi-colon, colon, fullstop, exclamation mark and other punctuation.
I have the following function to find all the positions of a substring:
function strallpos($haystack,$needle,$offset = 0){
$result = array();
for($i = $offset; $i<strlen($haystack); $i++){
$pos = strpos($haystack,$needle,$i);
if($pos !== FALSE){
$offset = $pos;
if($offset >= $i){
$i = $offset;
$result[] = $offset;
}
}
}
return $result;
}
Problem is, if I try to find all positions of the substring "us", it will return positions of the occurrence in "prospectus" or "inclusive" etc..
Is there any way to prevent this? Possibly using regular expressions?
Thanks.
Stefan

You can capture offset with preg_match_all:
$str = "Problem is, if I try to find all positions of the substring us, it will return positions of the occurrence in prospectus or inclusive us us";
preg_match_all('/\bus\b/', $str, $m, PREG_OFFSET_CAPTURE);
print_r($m);
output:
Array
(
[0] => Array
(
[0] => Array
(
[0] => us
[1] => 60
)
[1] => Array
(
[0] => us
[1] => 134
)
[2] => Array
(
[0] => us
[1] => 137
)
)
)

Just to demonstrate a non regexp alternative
$string = "It behooves us all to offer the prospectus for our inclusive syllabus";
$filterword = 'us';
$filtered = array_filter(
str_word_count($string,2),
function($word) use($filterword) {
return $word == $filterword;
}
);
var_dump($filtered);
where the keys of $filtered are the offset position
If you want case-insensitive, replace
return $word == $filterword;
with
return strtolower($word) == strtolower($filterword);

How to parse heterogenous markup with PHP?

I have a string with custom markup for saving songs with chords, tabulatures, notes etc. It contains
things in various brackets: \[.+?\], \[[.+?\]], \(.+?\)
arrows: <-{3,}>, \-{3,}>, <\-{3,}
and so on...
Sample text might be
Text Text [something]
--->
Text (something 021213)
Now I wish to parse the markup into array of tokens, objects of corresponding classes, which would look like (matched parts in brackets)
ParsedBlock_Text ("Text Text ")
ParsedBlock_Chord ("something")
ParsedBlock_Text (" ")
ParsedBlock_NewColumn
ParsedBlock_Text (" text ")
ParsedBlock_ChordDiagram ("something 021213")
I know how to match them, but either I must match each different pattern, and save offsets to properly sort the array, or I match them at once and I don't know which one has been matched.
Thanks, MK

Assuming you do not try to nest these structures, this will tokenize your text:
function ParseText($text) {
$re = '/\[\[(?P<DoubleBracket>.*?)]]|\[(?P<Bracket>.*?)]|\((?P<Paren>.*?)\)|(?<Arrow><---+>?|---+>)/s';
$keys = array('DoubleBracket', 'Bracket', 'Paren', 'Arrow');
$result = array();
$lastStart = 0;
if (preg_match_all($re, $text, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE)) {
foreach ($matches as $match) {
$start = $match[0][1];
$prefix = substr($text, $lastStart, $start - $lastStart);
$lastStart = $start + strlen($match[0][0]);
if ($prefix != '' && !ctype_space($prefix)) {
$result []= array('Text', trim($prefix));
}
foreach ($keys as $key) {
if (isset($match[$key]) && $match[$key][1] >= 0) {
$result []= array($key, $match[$key][0]);
break;
}
}
}
}
$prefix = substr($text, $lastStart);
if ($prefix != '' && !ctype_space($prefix)) {
$result []= array('Text', trim($prefix));
}
return $result;
}
Example:
$mytext = <<<'EOT'
Text Text [something]
--->
Text (something 021213)
More Text
EOT;
$parsed = ParseText($mytext);
foreach ($parsed as $item) {
print_r($item);
}
Output:
Array
(
[0] => Text
[1] => Text Text
)
Array
(
[0] => Bracket
[1] => something
)
Array
(
[0] => Arrow
[1] => --->
)
Array
(
[0] => Text
[1] => Text
)
Array
(
[0] => Paren
[1] => something 021213
)
Array
(
[0] => Text
[1] => More Text
)
http://ideone.com/kJQrBw
If you want to add more patterns to the regex, make sure you put longer patterns at the start, so they are not mistakenly matched as the wrong type.

Print list of defined str_word_count matches within exploded array by index

I previously had some help with this matter from #HSZ but have had trouble getting the solution to work with an existing array. What im trying to do is explode quotes, make all words uppercase plus get each words index value and only echo the ones i define then implode. In simplest terms, always remove the 4th word within quotations or the 3rd and 4th... This could probably done with regex as well.
Example:
Hello [1] => World [2] => This [3] => Is [4] => a [5] => Test ) 6 only outputs the numbers i define, such as 1 - (Hello) and [2] - (World) leaving out [3], [4], [5] and [6] or This is a test leaving only Hello World, or 1 and [6] for Hello Test...
Such as:
echo $data[1] + ' ' + $data[6]; //Would output index 1 and 6 Hello and Test
Existing Code
if (stripos($data, 'test') !== false) {
$arr = explode('"', $data);
for ($i = 1; $i < count($arr); $i += 2) {
$arr[$i] = strtoupper($arr[$i]);
$arr[$i] = str_word_count($arr, 1); // Doesnt work with array of course.
}
$arr = $matches[1] + ' ' + $matches[6];
$data = implode('"', $arr);
}

Assuming $data is 'Hello world "this is a test"', $arr = explode('"', $data) will be the same as:
$arr = array (
[0] => 'Hello World',
[1] => 'This is a test'
)
If you want to do things with this is a test, you can explode it out using something like $testarr = explode(' ', $arr[1]);.
You can then do something like:
$matches = array();
foreach ($testarr as $key => $value) {
$value = strtoupper($value);
if(($key+1)%4 == 0) { // % is modulus; the result is the remainder the division of two numbers. If it's 0, the key+1 (compensate for 0 based keys) is divisible by 4.
$matches[] = $value;
}
}
$matches = implode('"',$matches);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regular expression php to parse a file - php

(.+?)-\s([\d\.]+)\s\(p\s=\s([\d\.]+)\) That will grab the element (e.g. Geldanamycin) in group 1, the related value in group 2, and the p value in group 3. Play with the regex here.

Related

PHP foreach() loop

PHP replace symbols all possible variants

PHP find positions of all occurrences of a particular word in a string

How to parse heterogenous markup with PHP?

Print list of defined str_word_count matches within exploded array by index

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regular expression php to parse a file - php

(.+?)-\s*([\d\.]+)\s*\(p\s*=\s*([\d\.]+)\) That will grab the element (e.g. Geldanamycin) in group 1, the related value in group 2, and the p value in group 3. Play with the regex here.

Related

PHP foreach() loop

PHP replace symbols all possible variants

PHP find positions of all occurrences of a particular word in a string

How to parse heterogenous markup with PHP?

Print list of defined str_word_count matches within exploded array by index

Categories

Resources

(.+?)-\s([\d\.]+)\s\(p\s=\s([\d\.]+)\) That will grab the element (e.g. Geldanamycin) in group 1, the related value in group 2, and the p value in group 3. Play with the regex here.