php regular expression parse data - php

I have a field which contain 20 character (pad string with space character from right) like below:
VINEYARD HAVEN MA
BOLIVAR TN
,
BOLIVAR, TN
NORTH TONAWANDA, NY
How can I use regular expression to parse and get data, the result I want will look like this:
[1] VINEYARD HAVEN [2] MA
[1] BOLIVAR [2] TN
[1] , or empty [2] , or empty
[1] BOLIVAR, or BOLIVAR [2] TN or ,TN
[1] NORTH TONAWANDA, or NORTH TONAWANDA [2] NY or ,NY
Currently I use this regex:
^(\D*)(?=[ ]\w{2}[ ]*)([ ]\w{2}[ ]*)
But it couldnot match the line:
,
Please help to adjust my regex so that I match all data above

What about this regex: ^(.*)[ ,](\w*)$ ? You can see working it here: http://regexr.com/3cno7.
Example usage:
<?php
$string = 'VINEYARD HAVEN MA
BOLIVAR TN
,
BOLIVAR, TN
NORTH TONAWANDA, NY';
$lines = array_map('trim', explode("\n", $string));
$pattern = '/^(.*)[ ,](\w*)$/';
foreach ($lines as $line) {
$res = preg_match($pattern, $line, $matched);
print 'first: "' . $matched[1] . '", second: "' . $matched[2] . '"' . PHP_EOL;
}

It's probably possible to implement this in a regular expression (try /(.*)\b([A-Z][A-Z])$/ ), however if you don't know how to write the regular expression you'll never be able to debug it. Yes, its worth finding out as a learning exercise, but since we're talking about PHP here (which does have a mechanism for storing compiled REs and isn't often used for bulk data operations) I would use something like the following if I needed to solve the problem quickly and in maintainable code:
$str=trim($str);
if (preg_match("/\b[A-Z][A-Z]$/i", $str, $match)) {
$state=$match[0];
$town=trim(substr($str,0,-2)), " ,\t\n\r\0\x0B");
}

Related

Parsing PDF tables into csv with php

I need to convert a pdf file with tables into CSV, so I used "PDFPARSER" in order to parse the entire text, then with pregmatch_all search the patterns of each table so I can create an array from each table of the pdf.
The structure of the following PDF is:
When I parse I get this
ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas
I figured out how to pregmatch_all all the ECO-XXXXX, but I don't know how to pregmatch all the descriptions
This is what is working for ECO-XXXXXX
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('publication.pdf');
$text = $pdf->getText();
echo $text;
$pattern = '/ECO-[.-^*-]{3,}| ECO-[.-^*-]{4,}\s\b[NMB]\b|ECO-[.-^*-]{4,}\sUP| ECO-[.-^*-]{3,}\sUP\s[B-N-M]{1}| ECO-[.-^*-]{3,}\sRX/' ;
preg_match_all($pattern, $text, $array);
echo "<hr>";
print_r($array);
I get
Array ( [0] => Array ( [0] => ECO-698 [1] => ECO-CHI-522 [2]
You may try this regex:
(ECO[^\s]+)\s+(.*?)(?=ECO|\z)
As per the input string, group1 contains the ECO Block and group 2 contains the descriptions.
Explanation:
(ECO[^\s]+) capture full ECO block untill it reaches white space.
\s+one or more white space
(.*?)(?=ECO|\z) Here (.*?) matches description and (?=ECO|\z) is a positive look ahead to match ECO or end of string (\z)
Regex101
Source Code (Run here):
$re = '/(ECO[^\s]+)\s+(.*?)(?=ECO|\z)/m';
$str = 'ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
$val=1;
foreach ($matches as $value)
{
echo "\n\nRow no:".$val++;
echo "\ncol 1:".$value[1]."\ncol 2:".$value[2];
}
UPDATE AS per the comment:
((?:ECO-(?!DE)[^\s]+)(?: (?:RX|B|N|M|UP|UP B|UP N|UP M))?)\s+(.*?)(?=(?:ECO-(?!DE))|\z)
Regex 101 updated

Splitting a string by comma while ignoring other commas (not quoted)

I am working with an API and getting results via an API. I am having trouble with the delimitation to split the array. Below is the sample example data I am receiving from the API:
name: jo mamma, location: Atlanta, Georgia, description: He is a good
boy, and he is pretty funny, skills: not much at all!
I would like to be able to split like so:
name: jo mamma
location: Atlanta, Georgia
description: He is a good boy, and he is pretty funny
skills: not much at all!
I have tried using the explode function and the regex preg_split.
$exploded = explode(",", $data);
$regex = preg_split("/,\s/", $data);
But not getting the intended results because its splitting also after boy, and after Georgia. Results below:
name: jo mamma
location: Atlanta
Georgia
description: He is a good boy
and he is pretty funny
skills: not much at all!
Any help would be greatly appreciated. Thanks.
Just do this simple split using Zero-Width Positive Lookahead . (It will split by only , after which there is text like name:).
$regex = preg_split("/,\s*(?=\w+\:)/", $data);
/*
Array
(
[0] => "name: jo mamma"
[1] => "location: Atlanta, Georgia"
[2] => "description: He is a good boy, and he is pretty funny"
[3] => "skills: not much at all!"
)
*/
Learn more on lookahead and lookbehind from here : http://www.regular-expressions.info/lookaround.html
Use this regex:
name:\s(.*),\slocation:\s(.*),\sdescription:\s(.*),\sskills:\s(.*)
$text = 'name: jo mamma, location: Atlanta, Georgia, description: He is a good boy, and he is pretty funny, skills: not much at all!';
preg_match_all('/name:\s(.*),\slocation:\s(.*),\sdescription:\s(.*),\sskills:\s(.*)/', $text, $text_matches);
for ($index = 1; $index < count($text_matches); $index++) {
echo $text_matches[$index][0].'<br />';
}
Output:
jo mamma
Atlanta, Georgia
He is a good boy, and he is pretty funny
not much at all!
Regex101

Parsing a nested sentence in PHP

I am very new guy at PHP and trying to parse a line from database and get some neccesarray information in it.
EDIT :
I have to take the authors names and surnames like for first example line :
the expected output should be :
Ayse Serap Karadag
Serap Gunes Bilgili
Omer Calka
Sevda Onder
Evren Burakgazi-Dalkilic
LINE
[Karadag, Ayse Serap; Bilgili, Serap Gunes; Calka, Omer; Onder, Sevda] Yuzuncu Yil Univ, Sch Med, Dept Dermatol. %#[Burakgazi-Dalkilic, Evren] UMDNJ Cooper Univ Med Ctr, Piscataway, NJ USA.1
I take this line from database. There are some author names which i have to take.
The author names are written in []. First their surnames which is separated with , and if there is a second author it is separated with ;.
I have to do this action in a loop because i have nearly 1000 line like this.
My code is :
<?php
$con=mysqli_connect("localhost","root","","authors");
if (mysqli_connect_errno())
{
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}
$result = mysqli_query($con,"SELECT Correspounding_Author FROM paper Limit 10 ");
while($row = mysqli_fetch_array($result))
{
echo "<br>";
echo $row['Correspounding_Author'] ;
echo "<br>";
// do sth here
}
mysqli_close($con);
?>
I am looking for some methods like explode() substr but as i mentioned at the beginning I cannot handle this nested sentence.
Any help is appreciated.
The code inside your while loop should be:
preg_match_all("/\\[([^\\]]+)\\]/", $row['Correspounding_Author'], $matches);
foreach($matches[1] as $match){
$exp = explode(";", $match);
foreach($exp as $val){
print(implode(" ", array_map("trim", array_reverse(explode(",", $val))))."<br/>");
}
}
The following should work:
$pattern = '~(?<=\[|\G;)([^,]+),([^;\]]+)~';
if (preg_match_all($pattern, $row['Correspounding_Author'], $matches, PREG_SET_ORDER)) {
print_r(array_map(function($match) {
return sprintf('%s %s', ltrim($match[2]), ltrim($match[1]));
}, $matches));
}
It's a single expression that matches items that:
Start with opening square bracket [ or continue where the last match ended followed by a semicolon,
End just before either a semicolon or closing square bracket.
See also: PCRE Assertions.
Output
Array
(
[0] => Ayse Serap Karadag
[1] => Serap Gunes Bilgili
[2] => Omer Calka
[3] => Sevda Onder
[4] => Evren Burakgazi-Dalkilic
)

PHP line parsed into separate objects

I have a line of code in my wordpress widget that outputs from an RSS feed:
<?php echo $entry->title ?>
and when displayed it looks like:
$220,000 :: 504 Freemason St, Unit 2B, Norfolk VA, 23510
or
$274,900 :: 1268 Bells Road, Virginia Beach VA, 23454
What is the easiest way to break this up into different objects?
For example, I'd like to have the price, street name, and city state zip in different objects. The problem is that some of the addresses have unit numbers and it's complicating things. Below is an example of how I would like it to work:
<?php echo $entry->price ?>
<?php echo $entry->street ?>
<?php echo $entry->citystatezip ?>
$220,000
504 Freemason St, Unit 2B
Norfolk VA, 23510
or
$274,900
1268 Bells Road
Virginia Beach VA, 23454
Here is a very crude regex that seems able to parse your string. I'm not the best with regexes, but it seems to work.
/^(\$(?:\d{1,3},?)*) :: (\d* [\w\s,\d]*), ([\w\s]* \w{2}, \d{5})$/
Use this with preg_match; the 1st group is the price, the 2nd is the address, and 3rd is the city/state/zip.
Example:
<?php
$ptn = '/^(\$(?:\d{1,3},?)*) :: (\d* [\w\s,\d]*), ([\w\s]* \w{2}, \d{5})$/';
if(preg_match($ptn, $entry->title, $match) === 1){
$price = $match[1];
$street = $match[2];
$citystatezip = $match[3];
}
What you need is a regular expression , check http://php.net/manual/en/function.preg-match.php
Use f.e. array explode ( string $delimiter , string $string [, int $limit ] ) which will give you array with strings if you use correct delimiter
The code below will fill your $entry object as required:
$string = '$274,900 :: 1268 Bells Road, Virginia Beach VA, 23454';
$pricePart = explode('::', $string);
$addressPart = explode(',', $pricePart[1]);
$entry = new stdClass();
$entry->price = trim($pricePart[0]);
if ( count($addressPart) == 3 ) {
$entry->street = trim($addressPart[0]);
$entry->citystatezip = trim($addressPart[1]) . ', ' . trim($addressPart[2]);
} else {
$entry->street = trim($addressPart[0]) . ', ' . trim($addressPart[1]);
$entry->citystatezip = trim($addressPart[2]) . ', ' . trim($addressPart[3]);
}
Updated answer to handle the unit bit
Update: changed array names, I hate $array.. names.. even if its just a mockup
(Note: this code isn't the prettiest, but its ment to give a base to work on. It should be cleaned up and improved a bit)

How can I match a string between two other known strings and nothing else with REGEX?

I want to extract a string between two other strings. The strings happen to be within HTML tags but I would like to avoid a conversation about whether I should be parsing HTML with regex (I know I shouldn't and have solved the problem with stristr() but would like to know how to do it with regular expressions.
A string might look like this:
...uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA
I am interested in <b>Primary Location</b>: United States-Washington-Seattle<br/> and want to extract 'United States-Washington-Seattle'
I tried '(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)' which worked in RegExr but not PHP:
preg_match("/(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)/", $description,$matches);
You used / as regex delimiter, so you need to escape it if you want to match it literally or use a different delimiter
preg_match("/(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)/", $description,$matches);
to
preg_match("/(?<=<b>Primary Location<\/b>:)(.*?)(?=<br\/>)/", $description,$matches);
or this
preg_match("~(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)~", $description,$matches);
Update
I just tested it on www.writecodeonline.com/php and
$description = "uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA";
preg_match("~(?<=<b>Primary Location</b>:)(.*?)(?=<br/>)~", $description, $matches);
print_r($matches);
is working. Output:
Array ( [0] => United States-Washington-Seattle [1] => United States-Washington-Seattle )
You can also get rid of the capturing group and do
$description = "uld select “Apply” below.<br/><br/><b>Primary Location</b>: United States-Washington-Seattle<br/><b>Travel</b>: Yes, 75 % of the Time <br/><b>Job Type</b>: Standard<br/><b>Region</b>: US Service Lines: ASL - Business Intelligence<br/><b>Job</b>: Business Intelligence<br/><b>Capability Group</b>: Con/Sol - BI&C<br/><br/>LOC:USA";
preg_match("~(?<=<b>Primary Location</b>:).*?(?=<br/>)~", $description, $matches);
print($matches[0]);
Output
United States-Washington-Seattle

Categories