php, preg_match, regex, extract specific text

php, preg_match, regex, extract specific text - php

I have a very big .txt file with our clients order and I need to move it in a mysql database . However I don't know what kind of regex to use as the information is not very different .
-----------------------
4046904
KKKKKKKKKKK
Laura Meyer
MassMutual Life Insurance
153 Vadnais Street
Chicopee, MA 01020
US
413-744-5452
lmeyer#massmutual.co...
KKKKKKKKKKK
373074210772222 02/12 6213 NA
-----------------------
4046907
KKKKKKKKKKK
Venkat Talladivedula
6105 West 68th Street
Tulsa, OK 74131
US
9184472611
venkat.talladivedula...
KKKKKKKKKKK
373022121440000 06/11 9344 NA
-----------------------
I tried something but I couldn't even extract the name ... here is a sample of my effort with no success
$htmlContent = file_get_contents("orders.txt");
//print_r($htmlContent);
$pattern = "/KKKKKKKKKKK(.*)\n/s";
preg_match_all($pattern, $htmlContent, $matches);
print_r($matches);
$name = $matches[1][0];
echo $name;

You may want to avoid regexes for something like this. Since the data is clearly organized by line, you could repeatedly read lines with fgets() and parse the data that way.

You could read this file with regex, but it may be quite complicated create a regex that could read all fields.
I recommend that you read this file line by line, and parse each one, detecting which kind of data it contains.

As you know exactly where your data is (i.e. which line its on) why not just get it that way?
i.e. something like
$htmlContent = file_get_contents("orders.txt");
$arrayofclients = explode("-----------------------",$htmlContent);
$newlinesep = "\r\n";
for($i = 0;i < count($arrayofclients);$i++)
{
$temp = explode($newlinesep,$arrayofclients[i]);
$idnum = $temp[0];
$name = $temp[4];
$houseandstreet = $temp[6];
//etc
}
or simply read the file line by line using fgets() - something like:
$i = 0;$j = 0;
$file = fopen("orders.txt","r");
$clients = [];
while ($line = fgets($ffile) )
{
if(line != false)
{
$i++;
switch($i)
{
case 2:
$clients[$j]["idnum"] = $line;
break;
case 6:
$clients[$j]["name"] = $line;
break;
//add more cases here for each line up to:
case 18:
$j++;
$i = 0;
break;
//there are 18 lines per client if i counted right, so increment $j and reset $i.
}
}
}
fclose ($f);
You could use regex's, but they are a bit awkward for this situation.
Nico

For the record, here is the regex that will capture the names for you. (Granted speed very well may be an issue.)
(?<=K{10}\s{2})\K[^\r\n]++(?!\s{2}-)
Explanation:
(?<=K{10}\s{2}) #Positive lookbehind for KKKKKKKKKK then 2 return/newline characters
\K[^\r\n]++ #Greedily match 1 or more non-return/newline characters
(?!\s{2}-) #Negative lookahead for return/newline character then dash
Here is a Regex Demo.
You will notice that my regex pattern changes slightly between the Regex Demo and my PHP Demo. Slight tweaking depending on environment may be required to match the return / newline characters.
Here is the php implementation (Demo):
if(preg_match_all("/(?<=K{10}\s{2})\K[^\r\n]++(?!\s{2}-)/",$htmlContent,$matches)){
var_export($matches[0]);
}else{
echo "no matches";
}
By using \K in my pattern I avoid actually having to capture with parentheses. This cuts down array size by 50% and is a useful trick for many projects. The \K basically says "start the fullstring match from this point", so the matches go in the first subarray (fullstrings, key=0) of $matches instead of generating a fullstring match in 0 and the capture in 1.
Output:
array (
0 => 'Laura Meyer',
1 => 'Venkat Talladivedula',
)

Related

Match/extract all characters between 2 strings

I want to extract John Doe from the string \n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n
So I guess the regex pattern needs to extract all characters between 'Volledige naam:' and '\n'. Is there anyone who can help me out?

You may use this regex to capture the name in group 1,
naam:\s+([a-zA-Z ]+)
As the name can only contain alphabets and spaces hence use of [a-zA-Z ]+ charset.
Php sample codes,
$str = "\n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n";
preg_match('/naam:\s+([a-zA-Z ]+)/', $str, $matches);
print_r($matches[1]);
Prints,
John Doe
Online demo

You can use
^Volledige naam:\s*\K.+
in multiline mode. That is
^ # start of line
Volledige naam:\s*\K # Volledige naam:, whitespaces and "forget" what#s been matched
.+ # rest of the line
In PHP:
<?php
$string = <<<DATA
*DRIVGo*
Volledige naam: John Doe
Telefoonnummer: 0612345678
IP: 94.214.168.86
DATA;
$regex = '~^Volledige naam:\s*\K.+~m';
if (preg_match($regex, $string, $match)) {
print_r($match);
}
?>
See a demo on ideone.com as well as on regex101.com.

The required string exists constantly at indexOf(':') and ends at the same call using the previously obtained value of indexOf as the offset in the subsequent call. (Given that the first call doesn't indicate that the result was not found and also that result of the send call [which would indicate the complete segment is not contained in the string])
Using a regular expression for this seems less useful because the source string will not varry in some way which requires automata.
Consider a simple split('\n') operation [optionally given a length of matches to obtain] which can be followed by further such calls if necessary to obtain the desired value without the need of any underlying engine.
The logic provided would be the same as a Regex is doing for you with it's underlying implementation although the associated cost both in terms of memory and performance is usually only justified for certain scenarios [for instance involving code page or locale conversions but not limited to, another case would be finding words with incorrect Declension, Punctuation etc.] which in this case do not seem to be needed.
Consider a parser construct with fields and methods that can obtain [point to] and also verify the integrity of the data when requires; This will also allow you to quickly serialize and deserialize the results in most cases.
Finally since you indicated your language is PHP I figured I should also let you know that equivalent of indexOf is strpos and the following code will demonstrate various ways to solve this problem without the use of regex.
$str = "\n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n";
$search = chr(10);
$parts = explode($search, $str);
$partsCount = count($parts);
print_r($parts);
if($partsCount > 1) print($parts[1]); //*DRIVGo*
print('-----Same results via different methodology------');
$groupStart = 0;
$groupEnd = $groupStart;
$max = strlen($str);
//While the groupEnd has not approached the length of str
while($groupEnd <= $max &&
($groupStart = strpos($str, $search, $groupStart)) >= 0 && // find search in str starting at groupStart, assign result to groupStart
($groupEnd = strpos($str, $search, $groupEnd + 1)) > $groupStart) // find search in str starting at groupEnd + 1, assign result to groupEnd
{
//Show the start, end, length and resulting substring
print_r([$groupStart, $groupEnd, $groupEnd - $groupStart, substr($str, $groupStart, $groupEnd - $groupStart)]);
//advance the parsing
$groupStart = $groupEnd;
}

Deleting/excluding empty lines from being saved and used in PHP

I'm currently fighting with excluding empty lines to be saved into the array $presidents (and by doing that, solving a problem where the foreach loop echoes the strings without Names).
I've tried a few things which should have worked, in my amateur opinion of course (preg_match(), trim(), etc.).
Can someone help me with this?
$file = fopen("namen2.txt", "r");
$presidents = array();
$count = count(file("namen2.txt"));
for($i = 1; $i <= $count;){
$file_line = fgets($file);
$line = explode(" ", $file_line);
$president = array();
$president["firstname"] = $line[0];
if(empty($president["firstname"] == false)){
$president["lastname"] = $line[1];
$president["counter"] = $i;
$presidents[] = $president;
$i++;
}
else{echo "fehler";}
}
foreach($presidents as $president_order){
echo "Hey ". $president_order["firstname"]. " ". $president_order["lastname"]. ", du bist der ". $president_order["counter"]. ". Präsident der USA in meiner Liste. \n";
}
Edit: I've solved my problems by changing conditions and controlling my input at the insertion phase, thanks for the great tips!

You should be dealing with spaces and unwanted characters at the insertion phase of the data, not after it so you should go back to that code and clean the data before the insertion takes place.
Using PHP's trim should work, there's no reason why it should not work for you. If you have a variable holding a string like " Firstname Lastname " then using that function will turn it to "Firstname Lastname".
It seems you are splitting each piece of data with spaces... that's not a good idea considering your data may possibly have to include spaces (First names can have spaces too, Mary Jane for example).

You can spot and delete empty lines with the following regular expression:
preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $presidents);
Since it's an array you may need to loop through array elements though.

I didn't test this code, but it seems like a tighter way to accomplish the task. Just trim() and check for positive string length. I've added an element-limit to explode() and condensed your data storage syntax.
$file=fopen("namen2.txt","r");
$i=0; // initialize counter
while(!feof($file)){
$line=trim(fgets($file)); // get line and trim leading/trailing whitespaces from it
if(strlen($line)){ // if there are any non-whitespace characters, proceed
$names=explode(' ',$line,2); // max of two elements produced, in case: Martin Van Buren
// explode will fail to correctly split: John Quincy Adams
$presidents[]=['counter'=>++$i,'firstname'=>$names[0],'lastname'=>(isset($names[1])?$names[1]:'')]; // store the data for future processing
}
}
fclose($file);
foreach($presidents as $pres){
echo "Hey {$pres["firstname"]} {$pres["lastname"]}, du bist der {$pres["counter"]} Präsident der USA in meiner Liste.\n";
}

Replace (add) words case sensitive from arrays

I am new to php and especially to regex.
My target is to enrich textes automatically with hints for "keywords" which are listed in arrays.
So far I had come.
$pattern = array("/\bexplanations\b/i",
"/\btarget\b/i",
"/\bhints\b/i",
"/\bhint\b/i",
);
$replacement = array("explanations <i>(Erklärungen)</i>",
"target <i>Ziel</i>",
"hints <i>Hinsweise</i>",
"hint <i>Hinweis</i>",
);
$string = "Target is to add some explanations (hints) from an array to
this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
returns:
target <i>Ziel</i> is to add some explanations <i>(Erklärungen)</i> (hints <i>Hinsweise</i>) from an array to this text. I am thankful for every hint <i>Hinweis</i>
1) In generally I wonder if there are more elegant solutions (eventually without replacing the original word)?
On later state the arrays will contain more than 1000 items... and come from mariadb.
2) How can I achive, that the word "Targets" achives a case sensitive treatment?
(without duplicate the length of my arrays).
Sorry for my English and many thanks in advance.

If you project to increase the size of your array and if the text may be a bit long, processing all the text (once per word) isn't a reliable way. Also, with a large array, it isn't reliable to build a giant alternation with all the words.
But if you store all the translations in an associative array and split the text on word-boundaries, you can do it in one pass:
// Translation array with all keys lowercase
$trans = [ 'explanations' => 'Erklärungen',
'target' => 'Ziel',
'hints' => 'Hinsweise',
'hint' => 'Hinweis'
];
$parts = preg_split('~\b~', $text);
$partsLength = count($parts);
// All words are in the odd indexes
for ($i=1; $i<$partsLength; $i+=2) {
$lcWord = strtolower($parts[$i]);
if (isset($trans[$lcWord]))
$parts[$i] .= ' <i>(' . $trans[$lcWord] . ')</i>';
}
$result = implode('', $parts);
Actually the limitation here is that you can't use a key that contains a word-boundary (if you want to translate a whole expression with several words for instance), but if you want to handle this case, you can use preg_match_all in place of preg_split and build a pattern that tests these special cases before, something like:
preg_match_all('~mushroom pie\b|\w+|\W*~iS', $text, $m);
$parts = &$m[0];
$partsLength = count($parts);
$i = 1 ^ preg_match('~^\w~', $parts[0]);
for (; $i<$partsLength; $i+=2) {
...
(if you have a lot of exceptions (too many) other strategies are possible.)

Enclose search words with parentheses in regex patterns and use backteferences in replacements. 
See this PHP demo:
$pattern = array("/\b(explanations)\b/i", "/\b(target)\b/i", "/\b(hints)\b/i", "/\b(hint)\b/i", );
$replacement = array('$1 <i>(Erklärungen)</i>', '$1 <i>Ziel</i>', '$1 <i>Hinsweise</i>', '$1 <i>Hinweis</i>', );
$string = "Target is to add some explanations (hints) from an array to this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
That way, you will replace with the words found with actual case used in the text.
Note it is very important to make sure the patterns go in the descending order with longer patterns coming before shorter ones (first Targets, then Target, etc.)

How much percent the string match the regex

Basically, I just wondering if exists a function like this:
$string = 'helloWorld';
// 1 uppercase, 1 lower case, 1 number and at least 8 of length
$regex = '/^\S*(?=\S{8,})(?=\S*[a-z])(?=\S*[A-Z])(?=\S*[\d])\S*$/'
$percent = matchPercent($string, $regex);
echo "the string match {$percent}% of the given regex";
Then, the result could be something like this:
the string match 75% of the given regex
Seeing another post and question, I can do somehitng like this:
$uppercase = preg_match('#[A-Z]#', $password);
$lowercase = preg_match('#[a-z]#', $password);
$number = preg_match('#[0-9]#', $password);
But, the goal is to work with any regex pattern at the function

If you want to do it the regex way and based on the use-case you've provided, we need to make the whole regex optional. Also we'll be using capturing groups in our lookaheads.
But first things first, let's improve your regex:
[\d] is redundant, just use \d.
\S*(?=\S{8,}) remove \S* part, we already have it at the end.
Our regex will look like ^(?=\S{8,})(?=\S*[a-z])(?=\S*[A-Z])(?=\S*\d)\S*$
Now is the tricky part, we will add groups in our lookaheads and make them optional:
^(?=(\S{8,})?)(?=(\S*[a-z])?)(?=(\S*[A-Z])?)(?=(\S*\d)?)\S*$
You might ask why? The groups are made so that we can track them later on. We make them optional so that our regex will always match. That way, we can do some math!
$regex = '~^(?=(\S{8,})?)(?=(\S*[a-z])?)(?=(\S*[A-Z])?)(?=(\S*\d)?)\S*$~';
$input = 'helloWorld';
preg_match_all($regex, $input, $m);
array_shift($m); // Get rid of group 0
for($i = 0, $j = $k = count($m); $i < $j; $i++){ // Looping
if(empty($m[$i][0])){ // If there was no match for that particular group
$k--;
}
}
$percentage = round(($k / $j) * 100);
echo $percentage;
Online php demo

EDIT: I see that Hamza had pretty much the same idea.
Sure! That's a really fun question.
Here is a solution for a simplified validation regex.
$str = 'helloword';
$regex = '~^(?=(\S{8,}))?(?=(\S*[a-z]))?(?=(\S*[A-Z]))?(?=(\S*[\d]))?.*$~';
if(preg_match($regex,$str,$m)) {
$totaltests = 4;
$passedtests = count(array_filter($m)) -1 ;
echo $passedtests / $totaltests;
}
Output: 0.5
How does it work?
For each condition (expressed by a lookahead), we capture the text that can be matched.
We define $totaltests as the total number of tests
We count the number of tests passed with count(array_filter($m)) -1 which removes the empty groups and Group 0, i.e. the overall match.
We divide.

Repeat pattern using preg_match

I want to be able to validate the strings below to allow data between backticks unlimited times as long as it is followed by a comma, if it is not a comma must be a ")". Whitespaces are allowed only out of the backticks not in them.
I am not experienced with regex so I dont know how to allow a repeated pattern. Below is my pattern so far.
Thanks
UPDATED
// first 3 lines should match
$lines[] = "(`a-z0-9_-`,`a-z0-9_-`,`a-z0-9_-`,`a-z0-9_-`)";
$lines[] = "( `a-z0-9_-`, `a-z0-9_-` ,`a-z0-9_-` , `a-z0-9_-` )";
$lines[] = "(`a-z0-9_-`,
`a-z0-9_-`
,`a-z0-9_-` ,`a-z0-9_-`)";
// these lines below should not match
$lines[] = "(`a-z0-9_-``a-z0-9_-`,`a-z0-9_-`,`a-z0-9_-`)";
$lines[] = "(`a-z0-9_-``a-z0-9_-`,`a-z0-9_-`.`a-z0-9_-`";
$pattern = '/~^\(\s*(?:[a-z0-9_-]+\s*,?\s*)+\)$~/';
$result = array();
foreach($lines as $key => $line)
{
if (preg_match($pattern, $line))
{
$result[$key] = 'Found match.';
}
else
{
$result[$key] = 'Not found a match.';
}
}
print("<pre>" . print_r($result, true). "</pre>");

You're very close. It looks like you want this:
$pattern = "~^\(\s*`[a-z0-9_-]+`\s*(?:,\s*`[a-z0-9_-]+`\s*)*\)$~";
The two problems with your regex were:
You had two sets of delimiters (slashes and tildes) - pick just one and stick with it. My personal preference is parentheses because then you don't have to escape anything "just because delimiters", but also it helps me remember that the entire match is the first entry in the match array.
By making the comma optional, you were allowing things you didn't want. The solution does involve repeating yourself a little, but it is more accurate.

Well you weren't very clear about the matching rules for the data between the brackets, and you didn't really specify if you wanted to capture anything so...I took a best guess based on context of your code, hopefully this will suit your needs.
edit: fixed code block so it would show the backtics in the pattern, also changed the delimiter from ~ to / since OP was confused about that
$pattern = '/^\((\s*`[a-z0-9_-]+`\s*[,)])+$/';

here is a generic repeat pattern:
preg_match_all("/start_string([^repeat_string].*?)end_string/si", $input, $output);
var_dump($output);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php, preg_match, regex, extract specific text - php

You may want to avoid regexes for something like this. Since the data is clearly organized by line, you could repeatedly read lines with fgets() and parse the data that way.

You could read this file with regex, but it may be quite complicated create a regex that could read all fields. I recommend that you read this file line by line, and parse each one, detecting which kind of data it contains.

Related

Match/extract all characters between 2 strings

Deleting/excluding empty lines from being saved and used in PHP

Replace (add) words case sensitive from arrays

How much percent the string match the regex

Repeat pattern using preg_match

Categories

Resources