Converting from div to table or parsing PDF in PHP

Converting from div to table or parsing PDF in PHP - php

I'm developing a WebApp in which I take an invoice converted from PDF to HTML, then parse the invoice lines.
I have a div in my main window which displays the contents.
But when I display the contents from the invoice in that div, all the contents appear overlapped.
In the converted invoice there is no table, only divs with absolute positioning. I can't make it any other way at least with this aproach, because that's the way the converter works.
So, as a solution I'm converting from "div to table", trying to decide when there is a change of row or not, based on the top parameter from the corresponding div.
However besides the invoice data, I also have the invoice header. I'm having difficulties to decide if the table is the same or not.
But so far, I think the solution passes through making 3 tables, one for the company logo, one for the header, and one for the data.
But I need all these tables to appear in the correct positions and with the correct sizes.
At the moment, I'm not allowed to paste invoice examples, and as I'm stuck in an early stage (close to the algorithm stage). I don't think any examples of my code and of the invoices could help anyone to understand the situation better.
But I promise to update this with examples soon.
As an alternative solution I could parse the PDF myself, but I haven't found a way to do it so far.
I'm using PHP to make the WebApp and verypdf pdf2html to make the conversion.
I know with that little information, is hard to get help.
Any ideas are welcome.

How about trying to cure the overlapping itself. For example you could strip all the styling information from the DIVs after the PDF is parsed into DIVs. Then you can apply your own styles.
It might be useful to know if all the invoices are in the same format/arrangement, or not.

Related

Table data extraction from image or scanned documents (Not pdf)

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

Best way to print HTML designed content?

I have a web app that let users design their own invitation cards that are then ordered and printed by us, and send to the customer.
The problem we have is, it's difficult to print the cards, exactly the way the user designed it. We are currently using wkhtmltopdf to export a pdf file from the users design, that is then send to print. This has caused us months of headache. See this example:
As you can see, there are some important differences between the result of the HTML and the PDF. Most noticable is the line break of "Välkommen på". Other common differences is line-height changing so that text overlap eachother on the PDF file, because they are more, or less seperated from eachother than in the HTML.
My questions to you is:
Would you use this method or is there any other, simpler method could use to print the cards? For example, is there an easy way to just print the HTML itself from the browser (Auto fitting to the correct size of the content and so on), or do you have any other idea?
If you are a wkhtmltopdf wizard, do you know how we can solve issues like in the picture with the fonts?

I was able to solve the problem with the line breaks by using the CSS Attribute white-space: nowrap on all divs of my HTML content.

Best method to create a PDF from MySQL: TCPDF/FPDF or FDF?

Our company allows its clients to view reports via our website. The pages are php based and the data is collected from MySQL. These reports were written a long time ago and include inline css. The pages themselves look fine, but the print version is lacking. I want to take the reports and create visually appealing "printable" pages that contain our branding.
I have found three solutions so far.
#Media Print Stylesheets
This is the easiest method, but does not give me complete layout control. I want landscape mode and need to control where the page breaks occur so this method has been eliminated from my list of possible solutions. The reports are built by looping through PHP data, so while I can always put a page break after a or for example, I can't stop the page from breaking before it gets to the next set of data.
TCPDF/FPDF
From what I have seen these classes will give me all of the control I need to customer a PDF. The challenge is that this appears to be a little more advanced than my programming skills require, and all of the inline CSS contained within the HTML tables may throw off formatting.
FDF
I am leaning towards this method if I understand it correctly. First I would create a PDF form and define all of the fields to be populated by the MySQL data. Then I would create a FDF file that would populate the form template with the data from the database. It seems easier to me to create a visually pleasing form via PDF and then populate that form using this method, rather than create the entire pdf from scratch using method 2.
Does it sound like I am on the right track? Are any of these methods "easier" than the other?
Any help is greatly appreciated.

TCPDF has the most control of each page which is what I am looking for. It is extremely sensitive when writing HTML, but that is the only downside I have found so far.

There's this excellent answer on SO already.
If you're looking for easy, my money is on mPDF. I found it to be the easiest, and essentially an out-of-the-box solution (often zero server configuration to do).

I think you should try out wkhtmltopdf.
https://code.google.com/p/wkhtmltopdf/
As for the TCPDF/FPDF pagination issue, you can see this other question for the solution provided and use the flow in it to sort yours out.
TCPDF / FPDF - Page break issue
Just found this other solution as well and think you'll need it
Convert HTML + CSS to PDF with PHP?

For me personally, FPDF works great to fetch data from my database, insert into the FPDF class and dynamically create PDF's for customers.
I see some people want to write HTML/CSS to create PDF's but you will always have
differences as the browser parses the HTML/CSS differently than when using it in PDF's.
When using FPDF's built-in method's, I have been able to get exactly what I wanted
and haven't seen any issues (yet).

Multipage invoice/document in PHP

TL;DR: I have problems with PHP generating PDF's longer than 1 page total.
Hello again. My goal is to create a script that will basically get all the important data and create a A4 format PDF invoice/document for printing/mailing/archiving. Generating PDF document is fine as long as the document does not overflow.
I want the invoice pages to be outlined with a border, it should contain:
the needed stuff required for the invoice to be valid
billed products/other information
place for supplier/customer signature and stamp or other data
All the pages HAVE to contain header and footer (company logo) and footer (page # of # - Invoice/Document ID - Date and Time - office ID - Printer ID, assigned personnel, whatever someone can ask), as well as border around the document body (under header, above footer).
Everything is fine as long as the document size is not bigger than
$pageSize-$pageMargins-$header-$footer-$invoiceDataBlock-$signaturesBlock
which is basically just like 10 cm for the actual invoiced items. If the document is bigger, I actually create attachment for the invoice manually using spreadsheet editor.
The question is: What can I do to create a multipage PDF document that has no problems like invoiced items overlaying the header/footer? I need to know when to continue on the next page. How do I know this? What is the best way to accomplish this task?
Thank you in advance!

I've used both FPDF and TCPDF to generate multi-page invoice files. They are roughly the same in terms of how they work. (I started with FPDF, then switched to TCPDF when I needed to include Unicode characters, which FPDF didn't support at the time.)
As Eugen suggested, you can hand-roll your own headers and footers more easily than using the functions built in to either FPDF or TCPDF.
My strategy for making sure I don't overwrite footers is simply to be careful with the data included on the invoice. When adding new SKUs, I test long names to make sure they will fit in their field in the invoice PDF. For items that must be variable-length, I put unknown content onto its own line to reduce possible impact:
Domain registration (2 years)
↳ example.com
As I generate each page of the invoice, I keep track of how many lines I've used. I know I can safely put 20 lines of items, and I know my maximum single item is 2 lines, then when I get to 20 lines, I start a new page. 15 items means 1 page. 25 items means two pages. The item counter goes up, and every time I hit the 20 line limit, it generates the next page and resets the page item counter.
Note that I'm not including any code in this answer because you didn't include any code in your question. If you'd like help with implementation, I suspect that will be grounds for an additional question. :-)

Use TCPDF. It has a very handy SetY() / GetY() pair of functions, that allows you to know, where on the page you are. You can use this to know when to do a page break.
Hint: Do not use the Header/Footer capabilities - they are clunky. Draw your own headers/footers.
Edit
As from discussion below, here are some details: To avoid overlaying you have 2 possibilities
Use getStringHeight() and calculate
Use Transactions
The first version draws its rationale from the fact, that of all objects you typically use in generating a PDF a text-flow is the only one, of which you cannot tell beforehand the height it will use. getStringHeight() provides you with a good enough estimate, so you know before adding the element, if it will fit on the page (leaving enough room on the bottom for the footer). So basically you extend your drawing loop to calculate the height of each element and test, if you need to start a new page first. This allows also for some sort of keeptogether, e.g. if the remaining space after a section title is too low, start a new page before, to keep section title and section body together.
The second version is even easier: In TCPDF you can use transactions simialr to a Database: Start a transaction, draw, if the result is not to your liking roll back, else commit. We found this to be quite a performance hog, ultimately deciding against it for long textual reports, but a 2-page invoice is a very different beast.

Pagination of text from xml file onto html page

O.K. so I'm developing a website to feature my fiction writings. I'm putting all of my documents into XML files, pulling and parsing them from the server with PHP and displaying them on the page. You can visit the page here for an example.
As implied from the background image, What I would like to do is take the text and split it into two columns, (with the text from the first spilling into the second), then allow for the overflow to be paginated so that there is no scrolling necessary. In other words, I'd like for the text to read like a book with the paging based on how long the body of the XML document is.
I would like for this to be done on the server side using PHP or something similar. Is there a way I can do this with an xsl stylesheet or a server-side script? I've been looking everywhere and can't seem to find anything.
Any help is appreciated.
Mr. Mutant

This is a surprisingly hard problem in general, and it's one you'll have no end of trouble with if you try to do it on the server. The problem with paginating HTML text is that where the page breaks go are entirely contingent on the client. The server doesn't know the client's screen resolution, font selection, or window size, and apart from the text itself those are the dependent variables for the problem.
I'd be surprised if at this point there weren't some jQuery library that just does this, but when I had to implement it myself about 7 years ago, here's the approach I took:
Create a div for each column. Each one contains the entirety of the document text. Style the divs with fixed line height. Put the column divs bottom in the document's z-order. Now you can lay out the rest of the page, leaving holes of known size in the layout that the divs can show through, and by manipulating the vertical position of each div you can control which line is the first to appear inside a given hole.
You can then let the client manipulate the font size, and as long as you recalculate the height of the holes and then reposition the divs properly, it will all magically work.
There may be ways of doing this in HTML5 that are easier; I would definitely look into that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.