Parsing PDF tables into csv with php - php

I need to convert a pdf file with tables into CSV, so I used "PDFPARSER" in order to parse the entire text, then with pregmatch_all search the patterns of each table so I can create an array from each table of the pdf.
The structure of the following PDF is:
When I parse I get this
ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas
I figured out how to pregmatch_all all the ECO-XXXXX, but I don't know how to pregmatch all the descriptions
This is what is working for ECO-XXXXXX
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('publication.pdf');
$text = $pdf->getText();
echo $text;
$pattern = '/ECO-[.-^*-]{3,}| ECO-[.-^*-]{4,}\s\b[NMB]\b|ECO-[.-^*-]{4,}\sUP| ECO-[.-^*-]{3,}\sUP\s[B-N-M]{1}| ECO-[.-^*-]{3,}\sRX/' ;
preg_match_all($pattern, $text, $array);
echo "<hr>";
print_r($array);
I get
Array ( [0] => Array ( [0] => ECO-698 [1] => ECO-CHI-522 [2]

You may try this regex:
(ECO[^\s]+)\s+(.*?)(?=ECO|\z)
As per the input string, group1 contains the ECO Block and group 2 contains the descriptions.
Explanation:
(ECO[^\s]+) capture full ECO block untill it reaches white space.
\s+one or more white space
(.*?)(?=ECO|\z) Here (.*?) matches description and (?=ECO|\z) is a positive look ahead to match ECO or end of string (\z)
Regex101
Source Code (Run here):
$re = '/(ECO[^\s]+)\s+(.*?)(?=ECO|\z)/m';
$str = 'ECO-698 Acondicionador Frio-Calor ECO-CHI-522 Chimenea eléctrica con patas';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
$val=1;
foreach ($matches as $value)
{
echo "\n\nRow no:".$val++;
echo "\ncol 1:".$value[1]."\ncol 2:".$value[2];
}
UPDATE AS per the comment:
((?:ECO-(?!DE)[^\s]+)(?: (?:RX|B|N|M|UP|UP B|UP N|UP M))?)\s+(.*?)(?=(?:ECO-(?!DE))|\z)
Regex 101 updated

Related

Extracting links from a piece of text in PHP except ignoring image links

I have this piece of text, and I want to extract links from this. Some links with have tags and some will be out there just like that, in plain format. But I also have images, and I don't want their links.
How would I extract links from this piece of text but ignoring image links. So basically and google.com should both be extract.
string(441) "<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>"
I have tried the following but its incomplete:
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$hrefs[] = $tag->getAttribute('href');
Using just that one string to test, the following works for me:
$str = '<p class="fr-tag">Please visit https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
preg_match('~a href="(.*?)"~', $str, $strArr);
Using a href ="..." in the preg_match() statement returns an array, $strArr containing two values, the two links to google.
Array
(
[0] => a href="https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg"
[1] => https://www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg
)
I would try something like this.
Find and remove images tags:
$content = preg_replace("/<img[^>]+\>/i", "(image) ", $content);
Find and collect URLs.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $content, $match);
Output Urls:
print_r($match);
Good luck!
I played around with this a lot more and have an answer that may better suit what you are trying to do with a bit of "future proofing"
$str = '<p class="fr-tag">Please visit www.google.co.uk/?gfe_rd=cr&ei=9P2DVaW2BMWo8wfK74HYCg and this link should be filtered and this http://d.pr/i/1i2Xu <img class="fr-fin fr-tag" alt="Image title" src="https://cft-forum.s3-us-west-2.amazonaws.com/uploads%2F1434714755338-Screen+Shot+2015-06-19+at+12.52.28.png" width="300"></p>';
$str = str_replace(' ',' ',$str);
$strArr = explode(' ',$str);
$len = count($strArr);
for($i = 0; $i < $len; $i++){
if(stristr($strArr[$i],'http') || stristr($strArr[$i],"www")){
$matches[] = $strArr[$i];
}
}
echo "<pre>";
print_r($matches);
echo "</pre>";
I went back and analyzed your string and noticed that if you translate the to spaces you can then explode the string into an array, step through that and if any elements contain http or www then add them to the $matches array to be processed later. The output is pretty clean and easy to work with and you also get rid of most of the html markup this way.
Something to note is that this probably isn't the best way to do this. I haven't tested with any other strings but the one you offered so there's optimization that can be done.

Parsing a nested sentence in PHP

I am very new guy at PHP and trying to parse a line from database and get some neccesarray information in it.
EDIT :
I have to take the authors names and surnames like for first example line :
the expected output should be :
Ayse Serap Karadag
Serap Gunes Bilgili
Omer Calka
Sevda Onder
Evren Burakgazi-Dalkilic
LINE
[Karadag, Ayse Serap; Bilgili, Serap Gunes; Calka, Omer; Onder, Sevda] Yuzuncu Yil Univ, Sch Med, Dept Dermatol. %#[Burakgazi-Dalkilic, Evren] UMDNJ Cooper Univ Med Ctr, Piscataway, NJ USA.1
I take this line from database. There are some author names which i have to take.
The author names are written in []. First their surnames which is separated with , and if there is a second author it is separated with ;.
I have to do this action in a loop because i have nearly 1000 line like this.
My code is :
<?php
$con=mysqli_connect("localhost","root","","authors");
if (mysqli_connect_errno())
{
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}
$result = mysqli_query($con,"SELECT Correspounding_Author FROM paper Limit 10 ");
while($row = mysqli_fetch_array($result))
{
echo "<br>";
echo $row['Correspounding_Author'] ;
echo "<br>";
// do sth here
}
mysqli_close($con);
?>
I am looking for some methods like explode() substr but as i mentioned at the beginning I cannot handle this nested sentence.
Any help is appreciated.
The code inside your while loop should be:
preg_match_all("/\\[([^\\]]+)\\]/", $row['Correspounding_Author'], $matches);
foreach($matches[1] as $match){
$exp = explode(";", $match);
foreach($exp as $val){
print(implode(" ", array_map("trim", array_reverse(explode(",", $val))))."<br/>");
}
}
The following should work:
$pattern = '~(?<=\[|\G;)([^,]+),([^;\]]+)~';
if (preg_match_all($pattern, $row['Correspounding_Author'], $matches, PREG_SET_ORDER)) {
print_r(array_map(function($match) {
return sprintf('%s %s', ltrim($match[2]), ltrim($match[1]));
}, $matches));
}
It's a single expression that matches items that:
Start with opening square bracket [ or continue where the last match ended followed by a semicolon,
End just before either a semicolon or closing square bracket.
See also: PCRE Assertions.
Output
Array
(
[0] => Ayse Serap Karadag
[1] => Serap Gunes Bilgili
[2] => Omer Calka
[3] => Sevda Onder
[4] => Evren Burakgazi-Dalkilic
)

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).
$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)
You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

How to get the content in a div using php

In my application i am trying to get the google indexed pages and i came to know that the number is available in following div
<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>
now my question is how to extract the number from above div in a web page
Never user regexp to parse HTML. (See: RegEx match open tags except XHTML self-contained tags)
Use a HTML parser, like SimpleDOM (http://simplehtmldom.sourceforge.net/)
You can the use CSS rules to select:
$html = file_get_html('http://www.google.com/');
$divContent = $html->find('div#resultStats', 0)->plaintext;
$matches = array();
preg_match('/([0-9,]+)/', $divContent, $matches);
echo $matches[1];
Outputs: "1,960,000"
$str = '<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div> ';
$matches = array();
preg_match('/<div id="resultStats"> About ([0-9,]+?) results[^<]+<\/div>/', $str, $matches);
print_r($matches);
Output:
Array (
[0] => About 1,960,000 results (0.38 seconds)
[1] => 1,960,000
)
This is simple regex with subpatterns
([0-9,]+?) - means 0-9 numbers and , character at least 1 time and not greedy.
[^<]+ - means every character but < more than 1 time
echo $matches[1]; - will print the number you want.
You can use regex ( preg_match ) for that
$your div_string = '<div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>';
preg_match('/<div.*>(.*)<\/div>/i', $your div_string , $result);
print_r( $result );
output will be
Array (
[0] => <div id="resultStats"> About 1,960,000 results (0.38 seconds) </div>
[1] => About 1,960,000 results (0.38 seconds)
)
in this way you can get content inside div

extract two parts of a string using regex in php

I have this string:
<img src=images/imagename.gif alt='descriptive text here'>
and I am trying to split it up into the following two strings (array of two strings, what ever, just broken up).
imagename.gif
descriptive text here
Note that yes, it's actually the < and not <. Same with the end of the string.
I know regex is the answer, but I'm not good enough at regex to know how to pull it off in PHP.
Try this:
<?php
$s="<img src=images/imagename.gif alt='descriptive text here'>";
preg_match("/^[^\/]+\/([^ ]+)[^']+'([^']+)/", $s, $a);
print_r($a);
Output:
Array
(
[0] => <img src=images/imagename.gif alt='descriptive text here
[1] => imagename.gif
[2] => descriptive text here
)
Better use DOM xpath rather than regex
<?php
$your_string = html_entity_decode("<img src=images/imagename.gif alt='descriptive text here'>");
$dom = new DOMDocument;
$dom->loadHTML($your_string);
$x = new DOMXPath($dom);
foreach($x->query("//img") as $node)
{
echo $node->getAttribute("src");
echo $node->getAttribute("alt");
}
?>

Categories