preg_match_all for table having case sensitive code - php

I was trying to extract railway tickets data for internal use.
Total data looks like this table.
I have extracted every <td> content with preg_match_all condition but I cannot extract coach position as seen in this screenshot
I have tried code like below :
<?php
$result='tables code over here which you can find in pastebin link';
preg_match_all('/<TD class="table_border_both"><b>(.*)<\/b><\/TD>/s',$result,$matches);
var_dump($matches);
?>
I get rubbish output like:

you can use the following regular Expression:
$re = "/<TD class=\"table_border_both\"><b>([0-9][0-9])\n<\/b><\/TD>/";
$str = "<table width=\"100%\" border=\"0\" cellpadding=\"0\" cellspacing=\"1\" class=\"table_border\">\n\n<tr>\n<td colspan=\"9\" class=\"heading_table_top\">Journey Details</td>\n</tr>\n<TR class=\"heading_table\">\n<td width=\"11%\">Train Number</Td>\n<td width=\"16%\">Train Name</td>\n<td width=\"18%\">Boarding Date <br>(DD-MM-YYYY)</td>\n<td width=\"7%\">From</Td>\n<td width=\"7%\">To</Td>\n<td width=\"14%\">Reserved Upto</Td>\n<td width=\"21%\">Boarding Point</Td>\n<td width=\"6%\">Class</Td>\n</TR>\n<TR>\n<TD class=\"table_border_both\">*12559</TD>\n<TD class=\"table_border_both\">SHIV GANGA EXP </TD>\n<TD class=\"table_border_both\"> 5- 7-2014</TD>\n<TD class=\"table_border_both\">BSB </TD>\n<TD class=\"table_border_both\">NDLS</TD>\n<TD class=\"table_border_both\">NDLS</TD>\n<TD class=\"table_border_both\">BSB </TD>\n<TD class=\"table_border_both\"> SL</TD>\n</TR>\n</table>\n<TABLE width=\"100%\" border=\"0\" cellpadding=\"0\" cellspacing=\"1\" class=\"table_border\" id=\"center_table\" >\n\n<TR>\n<td width=\"25%\" class=\"heading_table_top\">S. No.</td>\n<td width=\"45%\" class=\"heading_table_top\">Booking Status <br /> (Coach No , Berth No., Quota)</td>\n<td width=\"30%\" class=\"heading_table_top\">* Current Status <br />(Coach No , Berth No.)</td>\n<td width=\"30%\" class=\"heading_table_top\">Coach Position</td>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 1</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 33,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 33</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 2</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 34,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 34</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 3</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 36,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 36</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<TD class=\"table_border_both\"><B>Passenger 4</B></TD>\n<TD class=\"table_border_both\"><B>S1 , 37,CK </B></TD>\n<TD class=\"table_border_both\"><B>S1 , 37</B></TD>\n<TD class=\"table_border_both\"><b>11\n</b></TD>\n</TR>\n<TR>\n<td class=\"heading_table_top\">Charting Status</td>\n<TD colspan=\"3\" align=\"middle\" valign=\"middle\" class=\"table_border_both\"> CHART PREPARED </TD>\n</TR>\n<TR>\n<td colspan=\"4\"><font color=\"#1219e8\" size=\"1\"><b> * Please Note that in case the Final Charts have not been prepared, the Current Status might upgrade/downgrade at a later stage.</font></b></Td>\n</TR>\n</table>";
preg_match_all($re, $str, $matches);
Most useful website for regex: http://regex101.com/

$regexp = '/<td class="table_border_both"><b>(.*)\s*<\/b><\/td>/gi';
You have line break in "Coach position" <td> and you forgot to mention it in regexp.
And it is better to use \s* so if you have there spaces or line brakes it won't fail.
You know that you have 4 columns, thus the result from regexp will have further transformations:
$data = array_chunk($matches, 4); // split up the matches by rows
And you have already ready rows ... few more lines and you have more than you need:
$data = array_map(function (array $row) {
return array_combine(['snum', 'status_book', 'status_cur', 'position'], $row);
}, $data); // assign each column in the row it's name
If we combine all the code, it will probably look like this:
$data = array_map(function (array $row) {
return array_combine(['snum', 'status_book', 'status_cur', 'position'], $row);
}, array_chunk($matches, 4));

Usage of \s+ is needed because there are some spaces in rows, otherwise it won't be matched
$data = file_get_contents("http://pastebin.com/raw.php?i=zJrvq95H");
preg_match_all("#<b>([0-9]{0,})\s+<\/b>#", $data, $matches);
print_r($matches[1]);
Result:
Array
(
[0] => 11
[1] => 11
[2] => 11
[3] => 11
)

Related

Parsing PDF tables into csv with php

I need to convert a pdf file with tables into CSV, so I used "PDFPARSER" in order to parse the entire text, then with pregmatch_all search the patterns of each table so I can create an array from each table of the pdf.
The structure of the following PDF is:
When I parse I get this
ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas
I figured out how to pregmatch_all all the ECO-XXXXX, but I don't know how to pregmatch all the descriptions
This is what is working for ECO-XXXXXX
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('publication.pdf');
$text = $pdf->getText();
echo $text;
$pattern = '/ECO-[.-^*-]{3,}| ECO-[.-^*-]{4,}\s\b[NMB]\b|ECO-[.-^*-]{4,}\sUP| ECO-[.-^*-]{3,}\sUP\s[B-N-M]{1}| ECO-[.-^*-]{3,}\sRX/' ;
preg_match_all($pattern, $text, $array);
echo "<hr>";
print_r($array);
I get
Array ( [0] => Array ( [0] => ECO-698 [1] => ECO-CHI-522 [2]
You may try this regex:
(ECO[^\s]+)\s+(.*?)(?=ECO|\z)
As per the input string, group1 contains the ECO Block and group 2 contains the descriptions.
Explanation:
(ECO[^\s]+) capture full ECO block untill it reaches white space.
\s+one or more white space
(.*?)(?=ECO|\z) Here (.*?) matches description and (?=ECO|\z) is a positive look ahead to match ECO or end of string (\z)
Regex101
Source Code (Run here):
$re = '/(ECO[^\s]+)\s+(.*?)(?=ECO|\z)/m';
$str = 'ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
$val=1;
foreach ($matches as $value)
{
echo "\n\nRow no:".$val++;
echo "\ncol 1:".$value[1]."\ncol 2:".$value[2];
}
UPDATE AS per the comment:
((?:ECO-(?!DE)[^\s]+)(?: (?:RX|B|N|M|UP|UP B|UP N|UP M))?)\s+(.*?)(?=(?:ECO-(?!DE))|\z)
Regex 101 updated

Parsing a nested sentence in PHP

I am very new guy at PHP and trying to parse a line from database and get some neccesarray information in it.
EDIT :
I have to take the authors names and surnames like for first example line :
the expected output should be :
Ayse Serap Karadag
Serap Gunes Bilgili
Omer Calka
Sevda Onder
Evren Burakgazi-Dalkilic
LINE
[Karadag, Ayse Serap; Bilgili, Serap Gunes; Calka, Omer; Onder, Sevda] Yuzuncu Yil Univ, Sch Med, Dept Dermatol. %#[Burakgazi-Dalkilic, Evren] UMDNJ Cooper Univ Med Ctr, Piscataway, NJ USA.1
I take this line from database. There are some author names which i have to take.
The author names are written in []. First their surnames which is separated with , and if there is a second author it is separated with ;.
I have to do this action in a loop because i have nearly 1000 line like this.
My code is :
<?php
$con=mysqli_connect("localhost","root","","authors");
if (mysqli_connect_errno())
{
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}
$result = mysqli_query($con,"SELECT Correspounding_Author FROM paper Limit 10 ");
while($row = mysqli_fetch_array($result))
{
echo "<br>";
echo $row['Correspounding_Author'] ;
echo "<br>";
// do sth here
}
mysqli_close($con);
?>
I am looking for some methods like explode() substr but as i mentioned at the beginning I cannot handle this nested sentence.
Any help is appreciated.
The code inside your while loop should be:
preg_match_all("/\\[([^\\]]+)\\]/", $row['Correspounding_Author'], $matches);
foreach($matches[1] as $match){
$exp = explode(";", $match);
foreach($exp as $val){
print(implode(" ", array_map("trim", array_reverse(explode(",", $val))))."<br/>");
}
}
The following should work:
$pattern = '~(?<=\[|\G;)([^,]+),([^;\]]+)~';
if (preg_match_all($pattern, $row['Correspounding_Author'], $matches, PREG_SET_ORDER)) {
print_r(array_map(function($match) {
return sprintf('%s %s', ltrim($match[2]), ltrim($match[1]));
}, $matches));
}
It's a single expression that matches items that:
Start with opening square bracket [ or continue where the last match ended followed by a semicolon,
End just before either a semicolon or closing square bracket.
See also: PCRE Assertions.
Output
Array
(
[0] => Ayse Serap Karadag
[1] => Serap Gunes Bilgili
[2] => Omer Calka
[3] => Sevda Onder
[4] => Evren Burakgazi-Dalkilic
)

How would one use PHP preg_match_all to differentiate anchor elements identified by attribute of inner HTML element?

I have sets of HTML anchor elements enclosing image elements. For each set, using PHP-CLI, I want to pull the URLs and classify them according to their types. The type of anchor can only be determined by an attribute of its child image element. It would be easy if there was only one of each type per set. My problem is when two anchor elements of one type are separated by one or more of the other types. My non-greedy parenthesized sub-pattern seems to become greedy and expands to find the second relevant child attribute. In my test script I'm trying to pull the 'Userlink' URLs from amongst the other types. Using a simple pattern like:
#<a href="(.*?)" custattr="value1"><img alt="Userlink"#
On a set like:
<li><img alt="Userlink" class="common_link_class" height="123" src="pic0.png" width="123" style="width: 123px;"></li><li><img alt="Socnet1" class="common_link_class" height="123" src="pic1.png" width="123" style="width: 123px;"></li><li><img alt="Socnet2" class="common_link_class" height="123" src="pic2.png" width="123" style="width: 123px;"></li><li><img alt="Usermail" class="common_link_class" height="123" src="pic3.png" width="123" style="width: 123px;"></li><li><img alt="Userlink" class="common_link_class" height="123" src="pic4.png" width="123" style="width: 123px;"></li>
(sorry, but the actual html is on one line like that)
My sub-pattern captures from the beginning of the first "Userlink" URL to the end of the last one.
I've tried many variations of look-aheads, not sure I should list them all here. So far they've either returned no match at all or the same as described above.
Here's my test script (running in a Bash shell):
#!/usr/bin/php
<?
$lines = 0;
$input = "";
$matches = array();
while ($line = fgets(STDIN)){
$input .= $line;
$lines++;
}
fwrite(STDERR, "Processing $lines\n");
$pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#';
if (preg_match_all($pcre,$input,$matches)){
fwrite(STDERR, "\$matches has " . count($matches) . " elements\n");
foreach ($matches[1] as $match){
fwrite(STDOUT, $match . "\n");
}
}
?>
What PCRE pattern for PHP's preg_match_all() would return the two "Userlink" URLs in the above example?
I have taken the liberty of changing your variable names:
$pattern = '~<a href="([^"]++)" custattr="value1"><img alt="Userlink"~';
if ($nb = preg_match_all($pattern, $input, $matches)) {
fwrite(STDERR, "\$matches has " . $nb . " elements\n");
fwrite(STDOUT, implode("\n", $match) . "\n");
}
Note that the preg_match_all function returns the number of matches.
This regex should work -
<a href="([^"]*?)"[^>]*\><img alt="Userlink"
You can see how it work here.
Testing it -
$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/';
if (preg_match_all($pcre,$input,$matches)){
var_dump($matches);
//$matches[1] will be the array containing the urls.
}
/*
OUTPUT-
array
0 =>
array
0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
1 =>
array
0 => string 'http://www.userlink1.com/my/page.html' (length=37)
1 => string 'http://www.userlink2.com/my/page.html' (length=37)
*/

php preg_match_all and preg_replace.

[caption id="attachment_1342" align="alignleft" width="300" caption="Cheers... "Forward" diversifying innovation to secure first place. "][/caption] A group of 35 students from...
I'm reading this data from api. I want the text just start with A group of 35 students from.... Help me to replace the caption tag with null. This is what I tried:
echo "<table>";
echo "<td>".$obj[0]['title']."</td>";
echo "<td>".$obj[0]['content']."</td>";
echo "</table>";
$html = $obj[0]['content'];
preg_match_all('/<caption>(.*?)<\/caption>/s', $html, $matches);
preg_replace('',$matches, $obj[0]['content']);
Any help.
$pattern = "/\[caption (.*?)\](.*?)\[\/caption\]/i";
$removed = preg_replace($pattern, "", $html);
echo preg_replace("#\[caption.*\[/caption\]#u", "", $str);
In the snippet mentioned in the question, regex search pattern is incorrect. there is no <caption> in the input. its <caption id....
Second using preg_replace doesn't serve any purpose here. preg_replace expects three arguments. first should be a regex pattern for search. second the string to replace with. and third is input string.
Following snippet using preg_match will work.
<?php
//The input string from API
$inputString = '<caption id="attachment_1342" align="alignleft" width="300" caption="Cheers... "Forward" diversifying innovation to secure first place. "></caption> A group of 35 students from';
//Search Regex
$pattern = '/<caption(.*?)<\/caption>(.*?)$/';
//preg_match searches inputString for a match to the regular expression given in pattern
//The matches are placed in the third argument.
preg_match($pattern, $inputString, $matches);
//First match is the whole string. second if the part before caption. third is part after caption.
echo $matches[2];
// var_dump($matches);
?>
if you still want to use preg_match_all for some reason. following snippet is modification of the one mentioned in question -
<?php
//Sample Object for test
$obj = array(
array(
'title' => 'test',
'content' => '<caption id="attachment_1342" align="alignleft" width="300" caption="Cheers... "Forward" diversifying innovation to secure first place. "></caption> A group of 35 students from'
)
);
echo "<table border='1'>";
echo "<td>".$obj[0]['title']."</td>";
echo "<td>".$obj[0]['content']."</td>";
echo "</table>";
$html = $obj[0]['content'];
//preg_match_all will put the caption tag in first match
preg_match_all('/<caption(.*?)<\/caption>/s', $html, $matches2);
//var_dump($matches2);
//use replace to remove the chunk from content
$obj[0]['content'] = str_replace($matches2[0], '', $obj[0]['content']);
//var_dump($obj);
?>
Thank you guys. I use explode function to do this.
$html = $obj[0]['content'];
$code = (explode("[/caption]", $html));
if($code[1]==''){
echo $code[1];
}

PHP Replace chars by functions

I try change characters by functions
<?php
$string = "Hi everybody people [gal~images/articles~100~100~4] here other imagen [gal~images/products~100~100~3]";
$regex = "/\[(.*?)\]/";
preg_match_all($regex, $string, $matches);
for($i=0; $i<count($matches[1]);$i++)
{
$match = $matches[1][$i];
$array = explode('~', $match);
//$newValuet="gal("".$array[1]."","".$array[2]."","".$array[3]."","".$array[4]."")";
$newValue="gal(".$array[1].",".$array[2].",".$array[3].",".$array[4].")";
$string = str_replace($matches[0][$i],$newValue,$string);
}
echo $string;
?>
The problem here :
$newValue="gal(".$array[1].",".$array[2].",".$array[3].",".$array[4].")";
$string = str_replace($matches[0][$i],$newValue,$string);
Function no give the right results i try differents methods but continue the problems , please i see all functions but no get this works if you can answer please put me some modification of this code for i can understand , thank´s a lot for all help
MORE INFORMATION
The script generate new values this values must send thhe function of gall for insert and show replace tags and put all well , show text and tags replace
$newValue="gal(".$array[1].",".$array[2].",".$array[3].",".$array[4].")";
$string = str_replace($matches[0][$i],$newValue,$string);
Here the function show the gal , the function execute when must execute into str_replace for put the text , before replace the tags and continue , only this fail
P.D : I pay respect some people , i go here for send my questions howewer in many cases writte bad but english no native for me , and sure for all people here in many cases luke the people respect if no speak well , all people here need learn or need help and howewer tomorrow i can help other people or this people me , nothing more pay , thank´s
Regards
Function of gallery
<?php
function gal($dire,$w_size,$h_size,$cols)
{
$n_images=0;
$dir=opendir("$dire");
while($file=readdir($dir))
{
if ($file!="." && $file!=".." && $file!="Thumbs.db" && $file!="index.html" && $file!=".htaccess" && $file!="config.dat")
{
$n_images++;
$imagenes[]=$file;
}
}
closedir($dir);
$num_img="".$n_images."";
$num_img_cols="".$cols."";
$num_filas="".ceil($num_img/$num_img_cols)."";
echo '<table width="10%" border="0" cellpadding="3" cellspacing="2" align="center">
<tr>
<td colspan="'.$num_img_cols.'"> </td>
</tr><tr>';
$x=1;
for ($j=0;$j<$n_images;$j++)
{
print "<td align='center'>";
//echo '<img src="'.$dire.'/'.$imagenes[$j].'" width="'.$w_size.'" height="'.$h_size.'">';
print "<a href='".$dire."/".$imagenes[$j]."' class='highslide' onclick=\"return hs.expand(this,{slideshowGroup:'group1',align:'center'})\">";
print '<img src="indexer_resizer.php?ruta_f='.$dire.'/&image='.$imagenes[$j].'&new_width='.$w_size.'&new_height='.$h_size.'" border="0" alt="Pulse para Ampliar" title="Pulse para Ampliar">
</a>';
if ($x%$num_img_cols==0)
{
print "</tr>";
}
$x++;
}
echo "</table>";
}
?>
**Function for replace tags**
<?php
$string = "Hola todos os presento una nueva galeria [galt~imagenes/articulos~100~100~4] Aqui otra más [gal~imagenes/productos~100~100~3]";
$regex = "/\[(.*?)\]/";
preg_match_all($regex, $string, $matches);
for($i=0; $i<count($matches[1]);$i++)
{
$match = $matches[1][$i];
$array = explode('~', $match);
//$newValuet="gal("".$array[1]."","".$array[2]."","".$array[3]."","".$array[4]."")";
$newValue="gal(".$array[1].",".$array[2].",".$array[3].",".$array[4].")";
$string = str_replace($matches[0][$i],$newValue,$string);
}
echo $string;
?>
This is all function and you can see here , the script use 2 things the function for see gallery and the script for replace and show gallery , but no works fine
The script generate new values this values must send thhe function of gall for insert and show replace tags and put all well , show text and tags replace
$newValue="gal(".$array[1].",".$array[2].",".$array[3].",".$array[4].")";
$string = str_replace($matches[0][$i],$newValue,$string);
Here the function show the gal , the function execute when must execute into str_replace for put the text , before replace the tags and continue , only this fail
P.D : I pay respect some people , i go here for send my questions howewer in many cases writte bad but english no native for me , and sure for all people here in many cases luke the people respect if no speak well , all people here need learn or need help and howewer tomorrow i can help other people or this people me , nothing more pay , thank´s
Regards
To capture those portions of the string try:
$string = "Hi everybody people [gal~images/articles~100~100~4] here other imagen [gal~images/products~100~100~3]";
preg_match_all('/\[(.+)\]/ismU, $string, $matches);
or
preg_match_all('/\[(\w+)\~([a-z0-9\/]+)\~(\d+)\~(\d+)\~(\d+)\]/ismU, $string, $matches);
The last one results a bit more complicated structure of $matches but prooves that you'll catch exactly the data you need.

Categories