Merge Two Half-Page PDF Documents with PHP

Merge Two Half-Page PDF Documents with PHP - php

A Friend of mine works on a Newspaper and asked me this on monday and i couldn't confirm if it was possible or not.
I know it's possible to merge 2 PDFs using PHP (as i've seen many other questions already answered), but what i'm not sure of is if i can merge a half-page PDF to fill a space in another PDF.
Imagine the following:
i have PDF1: a Half Page PDF, and then i have a 3 pages PDF: Pdf2.
In the first page of PDF2 i have a empty space to fit PDF1.
Can i do this? how?

I can't give you specific source code, but I can explain how to do it at the very low level. Also, what you're looking for is similar to what's called impositioning in the publishing industry.
You start out the same way as merging, which means pulling in pages from another document. You must bring in all dependencies of the page recursively. But watch out to avoid infinite loops, which do exist in PDF, so you must keep track of visited object. Don't use recursive functions, because your stack will easily overflow, PDF references can be very deep. You should implement the traversal recursion on the heap (Depth First Search is fine).
The key to stamping PDF on PDF is to turn the source Page object into an XObject form (not to be mixed with AcroForms or fillable form fields). An XObject form is very similar to a Page object, with the following exceptions:
The /Type /Page becomes /Type /XObject /Subtype /Form.
The page MediaBox and CropBox together become /BBox in the form. But be careful, both of them can be inherited via the page tree, so you must look for inherited attributes.
The page Rotate (also inheritable) becomes Matrix, which is a transformation (rotation) matrix, instead an angle.
The page's Resources, Group and Metadata can be brought in unchanged and added to the form object.
The page Contents stream must be transferred to the form. However, the page Contents is an external object, and may be an array, which means you need to merge the pieces. The XObject form is a stream object.
All other attributes are tricky, and you might want to ignore them if you are unsure.
Once this is done, all you have to do is paint the XObject form on the new page. You have to generate a unique name for the XObject and add it to the page's Resources. Painting itself is a series of a cm and a Do operators, just like painting an image. If you need to crop the original content, then you also need to set a clipping path before Do.
Needless to say, this is far from trivial, and there are lots of pitfalls. I have implemented this and I can tell you it really works, but it's harder than it seems. You must have a very good low level PDF library, and a very thorough understanding of the PDF specs.
I haven't discussed some of the other details, such as color management (what if you paint DeviceRGB on managed CMYK), PDF/A, PDF/X, transferring annotation and form fields, etc.
If this is beyond you, you should be looking for an open-source impositioning library, because it does pretty much the same. Impositioning means placing two or more pages on a blank sheet of paper, with the purpose of printing a book or a flyer. I do have a commercial solution as well.

Related

Table data extraction from image or scanned documents (Not pdf)

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

AVATAR Saving: "Compressing" multiple layered CSS div images into one file for download

I am building an avatar-generator for a PHP/MySQL site I am working on. It uses CSS to layer multiple .png files to create the background, body, facial expressions, etc. for a user's avatar. This I have covered.
I want to add a feature to my site that will allow the user to download their layered avatar "image" as one .jpg file. Is this even possible? I think I have seen this functionality before but can't recall the site where I saw this now.
Of course, I could come up with a series of pre-generated files that would cover all of the computations possible with my images, but with somewhere around 200 objects to choose from and a maximum of 10 layers of choices, the number of permutations possible is somewhere around 8.14702044e+22! Obviously, this is possible for me to do but I would be old and gray before completing the task!
Poking around the Internet has led me to believe there might be some way to "screen cap" - with what software and if it can capture a small section of the screen I don't know. Besides, would this bog down my site (which is currently running at top speed)?
I've searched through Stack Overflow for similar questions but didn't find anything that addresses my problem specifically. That said, I am not certain what to even search for (the precise terminology) as this concept of layering and saving as one image is foreign to me.

I found a solution to my own problem. It's not quite what I had in mind when planning this part of the project, but the end result will be close enough (with some other modifications).
I will implement html2canvas (http://html2canvas.hertzen.com) to take a screenshot of the user's avatar when they have pressed "Save". I will store the resultant image to my server and this will be their avatar. Their selected variable data will be stored in a database so that they can load up their avatar at a later date and make changes to it.

How should I generate an isometric image of a Minecraft skin in PHP?

I'm trying to generate 3D isometric views of players' heads, but I'm not sure what kind of support PHP has for this type of operation, or of any external libraries that may be better suited.
Basically I need to take a net like this (here is a diagram showing what each portion is mapped to) and make a 3D head from it. I also need to include the 'head accessory' portions, which should be slightly larger/offset from the actual head.
Does anyone know how I should go about this?

Well first it will be a complex job in my view.
The http://www.minecraftwiki.net/images/0/01/Skinzones.png file you mentioned is flat, but you have to convert that in ISOMETRIC 3D look, so you have to distort the images
For example look at the images below
So you can see that 3D box image is created from the pieces of other images, the logic is to add perspective to the flat images and join them. but as it is 2D we will call it Image Distortion.
Unfortunately GD Library which comes bundled with PHP is not advanced enough to let you do such things.
You have to use some other library like Image Magic and this link is tutorial for using distort functions http://www.imagemagick.org/Usage/distorts/
Second big thing is the processing of the images, you can process the images live but it will consume lots of resources on server, so it is suggested that you use pre processed images, and not process them every time.
To generate the Isometric image you have to write the code your self, and it may need alteration on each image character depending upon the size of the image. But when you have written a code it will be easy.
My Suggestion is to write your own code once, then alter it for every character and save the processed images in a sprite and use them when you add play functionality.
check out this link as well
http://www.fmwconcepts.com/imagemagick/index.php

Multipage invoice/document in PHP

TL;DR: I have problems with PHP generating PDF's longer than 1 page total.
Hello again. My goal is to create a script that will basically get all the important data and create a A4 format PDF invoice/document for printing/mailing/archiving. Generating PDF document is fine as long as the document does not overflow.
I want the invoice pages to be outlined with a border, it should contain:
the needed stuff required for the invoice to be valid
billed products/other information
place for supplier/customer signature and stamp or other data
All the pages HAVE to contain header and footer (company logo) and footer (page # of # - Invoice/Document ID - Date and Time - office ID - Printer ID, assigned personnel, whatever someone can ask), as well as border around the document body (under header, above footer).
Everything is fine as long as the document size is not bigger than
$pageSize-$pageMargins-$header-$footer-$invoiceDataBlock-$signaturesBlock
which is basically just like 10 cm for the actual invoiced items. If the document is bigger, I actually create attachment for the invoice manually using spreadsheet editor.
The question is: What can I do to create a multipage PDF document that has no problems like invoiced items overlaying the header/footer? I need to know when to continue on the next page. How do I know this? What is the best way to accomplish this task?
Thank you in advance!

I've used both FPDF and TCPDF to generate multi-page invoice files. They are roughly the same in terms of how they work. (I started with FPDF, then switched to TCPDF when I needed to include Unicode characters, which FPDF didn't support at the time.)
As Eugen suggested, you can hand-roll your own headers and footers more easily than using the functions built in to either FPDF or TCPDF.
My strategy for making sure I don't overwrite footers is simply to be careful with the data included on the invoice. When adding new SKUs, I test long names to make sure they will fit in their field in the invoice PDF. For items that must be variable-length, I put unknown content onto its own line to reduce possible impact:
Domain registration (2 years)
↳ example.com
As I generate each page of the invoice, I keep track of how many lines I've used. I know I can safely put 20 lines of items, and I know my maximum single item is 2 lines, then when I get to 20 lines, I start a new page. 15 items means 1 page. 25 items means two pages. The item counter goes up, and every time I hit the 20 line limit, it generates the next page and resets the page item counter.
Note that I'm not including any code in this answer because you didn't include any code in your question. If you'd like help with implementation, I suspect that will be grounds for an additional question. :-)

Use TCPDF. It has a very handy SetY() / GetY() pair of functions, that allows you to know, where on the page you are. You can use this to know when to do a page break.
Hint: Do not use the Header/Footer capabilities - they are clunky. Draw your own headers/footers.
Edit
As from discussion below, here are some details: To avoid overlaying you have 2 possibilities
Use getStringHeight() and calculate
Use Transactions
The first version draws its rationale from the fact, that of all objects you typically use in generating a PDF a text-flow is the only one, of which you cannot tell beforehand the height it will use. getStringHeight() provides you with a good enough estimate, so you know before adding the element, if it will fit on the page (leaving enough room on the bottom for the footer). So basically you extend your drawing loop to calculate the height of each element and test, if you need to start a new page first. This allows also for some sort of keeptogether, e.g. if the remaining space after a section title is too low, start a new page before, to keep section title and section body together.
The second version is even easier: In TCPDF you can use transactions simialr to a Database: Start a transaction, draw, if the result is not to your liking roll back, else commit. We found this to be quite a performance hog, ultimately deciding against it for long textual reports, but a 2-page invoice is a very different beast.

Programmatically combining images in PHP

I'm a big fan of Yahoo's recommendations for speeding up websites. One of the recommendations is to combine images where possible to cut down on size and the number of requests. However, I've noticed that while it can be easy to use CSS sprites for layouts, other image uses aren't as easily combined. The primary example I'm thinking of is a blog or article list, where each blog or article also has an image associated with it. Those images can greatly affect load time and page size, especially if they aren't optimized. What I'm looking for, in concept or in practice, is a way to dynamically combine those images while running them through a loss-less compression using PHP.
A few added thoughts or concerns:
Combining the images and generating
a dynamic CSS stylesheet to position
the backgrounds of the images might
be one way to go about it, but I
also worry about accessibility and
semantics. As far as I understand,
CSS images should be used for layout
elements and the img tag (with the
alt attribute) should be used for
images that are meant to convey
information. I could set the image
as a background to a div element and
substitute a title attribute for the
alt attribute, but I'm unsure about
the accessibility and semantic
implications of doing so.
Might the GD library be a good
candidate for something like this?
Can you recommend other options?

I wouldn't go down this route if I were you. Sure, you may save a few bytes in protocol overhead by reducing the number of requests, but this would more-tha-likely end up being self-defeating.
Imagine this scenario:
A blog site, whose front page has 10 articles at a time. Each article has it's own image associated with it. To save a byte or two of transfer time, you programatically create a composite image of all 10 article images. You now have one of two problems.
You must update the composite image each time a new post is made, as the most recent 10 images will have a modified set of content.
You decide to create a new composite each request, on the fly.
Obviously, #1 is preferable here, and would not be difficult to implement. However, what if a user searches for all posts tagged with the word "SQL"? You are unlikely to have a composite image of the first 10 results already created for this simple query, let alone a more complex one. Also, what happens if you want to update or delete an image? Once again you'd have to trigger the background creation of the composite.
How about an RSS aggregator, like Google Reader? It wouldn't have the required logic to figure out which portion of a composite image it would need to display, and would probably display the full image. (I mention Google Reader because I very rarely visit blog sites directly, tending to trust to an RSS aggregation service like Reader)
If it were me, I'd leave the single images alone. With modern connection speeds, the tradeoff between additional bandwidth overhead and on-server processing time is unlikely to win you and great gains.
Having said that, if you decide to go down this route anyway, I'd say the GD library is an excellent place to start.

You'd almost certainly be better off reducing the filesize of the images in articles, than combine them. I'd agree that there might be accessibility issues with the method you suggest. Also, I suppose it depends on what you mean by "dynamic" - if you're thinking of combining those images and generating CSS for each page load, you might well find that that results in slower page load times for users with average connection speeds.
As to your second point, GD could certainly handle that. A better use of GD for reducing page load times might be reducing the image quality of your article images to reduce filesizes, at article creation time, not at page load.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.