Parsing a nested sentence in PHP - php

I am very new guy at PHP and trying to parse a line from database and get some neccesarray information in it.
EDIT :
I have to take the authors names and surnames like for first example line :
the expected output should be :
Ayse Serap Karadag
Serap Gunes Bilgili
Omer Calka
Sevda Onder
Evren Burakgazi-Dalkilic
LINE
[Karadag, Ayse Serap; Bilgili, Serap Gunes; Calka, Omer; Onder, Sevda] Yuzuncu Yil Univ, Sch Med, Dept Dermatol. %#[Burakgazi-Dalkilic, Evren] UMDNJ Cooper Univ Med Ctr, Piscataway, NJ USA.1
I take this line from database. There are some author names which i have to take.
The author names are written in []. First their surnames which is separated with , and if there is a second author it is separated with ;.
I have to do this action in a loop because i have nearly 1000 line like this.
My code is :
<?php
$con=mysqli_connect("localhost","root","","authors");
if (mysqli_connect_errno())
{
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}
$result = mysqli_query($con,"SELECT Correspounding_Author FROM paper Limit 10 ");
while($row = mysqli_fetch_array($result))
{
echo "<br>";
echo $row['Correspounding_Author'] ;
echo "<br>";
// do sth here
}
mysqli_close($con);
?>
I am looking for some methods like explode() substr but as i mentioned at the beginning I cannot handle this nested sentence.
Any help is appreciated.

The code inside your while loop should be:
preg_match_all("/\\[([^\\]]+)\\]/", $row['Correspounding_Author'], $matches);
foreach($matches[1] as $match){
$exp = explode(";", $match);
foreach($exp as $val){
print(implode(" ", array_map("trim", array_reverse(explode(",", $val))))."<br/>");
}
}

The following should work:
$pattern = '~(?<=\[|\G;)([^,]+),([^;\]]+)~';
if (preg_match_all($pattern, $row['Correspounding_Author'], $matches, PREG_SET_ORDER)) {
print_r(array_map(function($match) {
return sprintf('%s %s', ltrim($match[2]), ltrim($match[1]));
}, $matches));
}
It's a single expression that matches items that:
Start with opening square bracket [ or continue where the last match ended followed by a semicolon,
End just before either a semicolon or closing square bracket.
See also: PCRE Assertions.
Output
Array
(
[0] => Ayse Serap Karadag
[1] => Serap Gunes Bilgili
[2] => Omer Calka
[3] => Sevda Onder
[4] => Evren Burakgazi-Dalkilic
)

Related

Parsing PDF tables into csv with php

I need to convert a pdf file with tables into CSV, so I used "PDFPARSER" in order to parse the entire text, then with pregmatch_all search the patterns of each table so I can create an array from each table of the pdf.
The structure of the following PDF is:
When I parse I get this
ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas
I figured out how to pregmatch_all all the ECO-XXXXX, but I don't know how to pregmatch all the descriptions
This is what is working for ECO-XXXXXX
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('publication.pdf');
$text = $pdf->getText();
echo $text;
$pattern = '/ECO-[.-^*-]{3,}| ECO-[.-^*-]{4,}\s\b[NMB]\b|ECO-[.-^*-]{4,}\sUP| ECO-[.-^*-]{3,}\sUP\s[B-N-M]{1}| ECO-[.-^*-]{3,}\sRX/' ;
preg_match_all($pattern, $text, $array);
echo "<hr>";
print_r($array);
I get
Array ( [0] => Array ( [0] => ECO-698 [1] => ECO-CHI-522 [2]
You may try this regex:
(ECO[^\s]+)\s+(.*?)(?=ECO|\z)
As per the input string, group1 contains the ECO Block and group 2 contains the descriptions.
Explanation:
(ECO[^\s]+) capture full ECO block untill it reaches white space.
\s+one or more white space
(.*?)(?=ECO|\z) Here (.*?) matches description and (?=ECO|\z) is a positive look ahead to match ECO or end of string (\z)
Regex101
Source Code (Run here):
$re = '/(ECO[^\s]+)\s+(.*?)(?=ECO|\z)/m';
$str = 'ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
$val=1;
foreach ($matches as $value)
{
echo "\n\nRow no:".$val++;
echo "\ncol 1:".$value[1]."\ncol 2:".$value[2];
}
UPDATE AS per the comment:
((?:ECO-(?!DE)[^\s]+)(?: (?:RX|B|N|M|UP|UP B|UP N|UP M))?)\s+(.*?)(?=(?:ECO-(?!DE))|\z)
Regex 101 updated

php regular expression parse data

I have a field which contain 20 character (pad string with space character from right) like below:
VINEYARD HAVEN MA
BOLIVAR TN
,
BOLIVAR, TN
NORTH TONAWANDA, NY
How can I use regular expression to parse and get data, the result I want will look like this:
[1] VINEYARD HAVEN [2] MA
[1] BOLIVAR [2] TN
[1] , or empty [2] , or empty
[1] BOLIVAR, or BOLIVAR [2] TN or ,TN
[1] NORTH TONAWANDA, or NORTH TONAWANDA [2] NY or ,NY
Currently I use this regex:
^(\D*)(?=[ ]\w{2}[ ]*)([ ]\w{2}[ ]*)
But it couldnot match the line:
,
Please help to adjust my regex so that I match all data above
What about this regex: ^(.*)[ ,](\w*)$ ? You can see working it here: http://regexr.com/3cno7.
Example usage:
<?php
$string = 'VINEYARD HAVEN MA
BOLIVAR TN
,
BOLIVAR, TN
NORTH TONAWANDA, NY';
$lines = array_map('trim', explode("\n", $string));
$pattern = '/^(.*)[ ,](\w*)$/';
foreach ($lines as $line) {
$res = preg_match($pattern, $line, $matched);
print 'first: "' . $matched[1] . '", second: "' . $matched[2] . '"' . PHP_EOL;
}
It's probably possible to implement this in a regular expression (try /(.*)\b([A-Z][A-Z])$/ ), however if you don't know how to write the regular expression you'll never be able to debug it. Yes, its worth finding out as a learning exercise, but since we're talking about PHP here (which does have a mechanism for storing compiled REs and isn't often used for bulk data operations) I would use something like the following if I needed to solve the problem quickly and in maintainable code:
$str=trim($str);
if (preg_match("/\b[A-Z][A-Z]$/i", $str, $match)) {
$state=$match[0];
$town=trim(substr($str,0,-2)), " ,\t\n\r\0\x0B");
}

Regex Pattern to fetch the text between the tags

my string is:
$p['message'] = '[name]Fozia Faizan[/name]\n[cell]03334567897[/cell]\n[city]Karachi, Pakistan[/city]';
What I want to do is to use REGEX pattern so as to get the result like this:
Name: Fozia Faizan
Cell #: 03334567897
City: Karachi, Pakistan
I've tried this regex:
$regex = "/\\[(.*?)\\](.*?)\\[\\/\\1\\]/";
$message = preg_match_all($regex, $p['message'], $matches);
but it didn't work at all. Please help
Well, using the great reply from #jh314, you could write:
$p['message'] = '[name]Fozia Faizan[/name]\n[cell]03334567897[/cell]\n[city]Karachi, Pakistan[/city]';
$m = array();
preg_match_all('|\[(.*?)](.*?)\[/\1]|', $p['message'], $m);
$result = #array_combine($m[1], $m[2]);
$out = "Name: {$result['name']}\nCell #: {$result['cell']}\nCity: {$result['city']}";
echo $out;
//$outHTML = nl2br("Name: {$result['name']}\nCell #: {$result['cell']}\nCity: {$result['city']}");
//echo $outHTML;
That will give you:
Name: Fozia Faizan
Cell #: 03334567897
City: Karachi, Pakistan
EDIT: You could also add # just before the name of the function like so: #array_combine, to suppress error at top of your page, only if this does work and you get the results as expected.
Your regex already works, just combine the result in $matches:
$p['message'] = '[name]Fozia Faizan[/name]\n[cell]03334567897[/cell]\n[city]Karachi, Pakistan[/city]';
$regex = "/\\[(.*?)\\](.*?)\\[\\/\\1\\]/";
preg_match_all('~\[(.*?)](.*?)\[/\1]~', $p['message'], $matches);
$result = array_combine ($matches[1], $matches[2]);
print_r($result);
will give you:
Array
(
[name] => Fozia Faizan
[cell] => 03334567897
[city] => Karachi, Pakistan
)

preg_match_all for table having case sensitive code

I was trying to extract railway tickets data for internal use.
Total data looks like this table.
I have extracted every <td> content with preg_match_all condition but I cannot extract coach position as seen in this screenshot
I have tried code like below :
<?php
$result='tables code over here which you can find in pastebin link';
preg_match_all('/<TD class="table_border_both"><b>(.*)<\/b><\/TD>/s',$result,$matches);
var_dump($matches);
?>
I get rubbish output like:
you can use the following regular Expression:
$re = "/<TD class=\"table_border_both\"><b>([0-9][0-9])\n<\/b><\/TD>/";
$str = "<table width=\"100%\" border=\"0\" cellpadding=\"0\" cellspacing=\"1\" class=\"table_border\">\n\n<tr>\n<td colspan=\"9\" class=\"heading_table_top\">Journey Details</td>\n</tr>\n<TR class=\"heading_table\">\n<td width=\"11%\">Train Number</Td>\n<td width=\"16%\">Train Name</td>\n<td width=\"18%\">Boarding Date <br>(DD-MM-YYYY)</td>\n<td width=\"7%\">From</Td>\n<td width=\"7%\">To</Td>\n<td width=\"14%\">Reserved Upto</Td>\n<td width=\"21%\">Boarding Point</Td>\n<td width=\"6%\">Class</Td>\n</TR>\n<TR>\n<TD class=\"table_border_both\">*12559</TD>\n<TD class=\"table_border_both\">SHIV GANGA EXP </TD>\n<TD class=\"table_border_both\"> 5- 7-2014</TD>\n<TD class=\"table_border_both\">BSB </TD>\n<TD class=\"table_border_both\">NDLS</TD>\n<TD class=\"table_border_both\">NDLS</TD>\n<TD class=\"table_border_both\">BSB </TD>\n<TD class=\"table_border_both\"> SL</TD>\n</TR>\n</table>\n<TABLE width=\"100%\" border=\"0\" cellpadding=\"0\" cellspacing=\"1\" class=\"table_border\" id=\"center_table\" >\n\n<TR>\n<td width=\"25%\" class=\"heading_table_top\">S. No.</td>\n<td width=\"45%\" class=\"heading_table_top\">Booking Status <br /> (Coach No , Berth No., Quota)</td>\n<td width=\"30%\" class=\"heading_table_top\">* Current Status <br />(Coach No , Berth No.)</td>\n<td width=\"30%\" class=\"heading_table_top\">Coach Position</td>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 1</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 33,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 33</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 2</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 34,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 34</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 3</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 36,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 36</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 4</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 37,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 37</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<td class=\"heading_table_top\">Charting Status</td>\n<TD colspan=\"3\" align=\"middle\" valign=\"middle\" class=\"table_border_both\"> CHART PREPARED </TD>\n</TR>\n<TR>\n<td colspan=\"4\"><font color=\"#1219e8\" size=\"1\"><b> * Please Note that in case the Final Charts have not been prepared, the Current Status might upgrade/downgrade at a later stage.</font></b></Td>\n</TR>\n</table>";
preg_match_all($re, $str, $matches);
Most useful website for regex: http://regex101.com/
$regexp = '/<td class="table_border_both"><b>(.*)\s*<\/b><\/td>/gi';
You have line break in "Coach position" <td> and you forgot to mention it in regexp.
And it is better to use \s* so if you have there spaces or line brakes it won't fail.
You know that you have 4 columns, thus the result from regexp will have further transformations:
$data = array_chunk($matches, 4); // split up the matches by rows
And you have already ready rows ... few more lines and you have more than you need:
$data = array_map(function (array $row) {
return array_combine(['snum', 'status_book', 'status_cur', 'position'], $row);
}, $data); // assign each column in the row it's name
If we combine all the code, it will probably look like this:
$data = array_map(function (array $row) {
return array_combine(['snum', 'status_book', 'status_cur', 'position'], $row);
}, array_chunk($matches, 4));
Usage of \s+ is needed because there are some spaces in rows, otherwise it won't be matched
$data = file_get_contents("http://pastebin.com/raw.php?i=zJrvq95H");
preg_match_all("#<b>([0-9]{0,})\s+<\/b>#", $data, $matches);
print_r($matches[1]);
Result:
Array
(
[0] => 11
[1] => 11
[2] => 11
[3] => 11
)

Convert Single Line Comments To Block Comments

I need to convert single line comments (//...) to block comments (/*...*/). I have nearly accomplished this in the following code; however, I need the function to skip any single line comment is already in a block comment. Currently it matches any single line comment, even when the single line comment is in a block comment.
## Convert Single Line Comment to Block Comments
function singleLineComments( &$output ) {
$output = preg_replace_callback('#//(.*)#m',
create_function(
'$match',
'return "/* " . trim(mb_substr($match[1], 0)) . " */";'
), $output
);
}
As already mentioned, "//..." can occur inside block comments and string literals. So if you create a small "parser" with the aid f a bit of regex-trickery, you could first match either of those things (string literals or block-comments), and after that, test if "//..." is present.
Here's a small demo:
$code ='A
B
// okay!
/*
C
D
// ignore me E F G
H
*/
I
// yes!
K
L = "foo // bar // string";
done // one more!';
$regex = '#
("(?:\\.|[^\r\n\\"])*+") # group 1: matches double quoted string literals
|
(/\*[\s\S]*?\*/) # group 2: matches multi-line comment blocks
|
(//[^\r\n]*+) # group 3: matches single line comments
#x';
preg_match_all($regex, $code, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE);
foreach($matches as $m) {
if(isset($m[3])) {
echo "replace the string '{$m[3][0]}' starting at offset: {$m[3][1]}\n";
}
}
Which produces the following output:
replace the string '// okay!' starting at offset: 6
replace the string '// yes!' starting at offset: 56
replace the string '// one more!' starting at offset: 102
Of course, there are more string literals possible in PHP, but you get my drift, I presume.
HTH.
You could try a negative look behind: http://www.regular-expressions.info/lookaround.html
## Convert Single Line Comment to Block Comments
function sinlgeLineComments( &$output ) {
$output = preg_replace_callback('#^((?:(?!/\*).)*?)//(.*)#m',
create_function(
'$match',
'return "/* " . trim(mb_substr($match[1], 0)) . " */";'
), $output
);
}
however I worry about possible strings with // in them. like:
$x = "some string // with slashes";
Would get converted.
If your source file is PHP, you could use tokenizer to parse the file with better precision.
http://php.net/manual/en/tokenizer.examples.php
Edit:
Forgot about the fixed length, which you can overcome by nesting the expression. The above should work now. I tested it with:
$foo = "// this is foo";
sinlgeLineComments($foo);
echo $foo . "\n";
$foo2 = "/* something // this is foo2 */";
sinlgeLineComments($foo2);
echo $foo2 . "\n";
$foo3 = "the quick brown fox";
sinlgeLineComments($foo3);
echo $foo3. "\n";;

Categories