I am writing a system at the moment, where a user can can select 1 item from 8 arrays (creating 8 options). They then have the option to print out there selections a PDF.
The PDF is generated using DOMPDF and that data is sent to via POST. I obviously need to test that each combination of options prints correctly (by my mind this means a hell of a lot of manual testing).
Is there anything I can do with a bash script to automate the testing the process? As the content of the arrays will never change would it possible to write a testing script of some kind I can fire through the browser of the terminal?
If the PDF-Files for one combination of options are always identical, I would do the following:
I would create a "Master-PDF" for each option-combination. Then I would create the PDF-Files on the local filesystem, convert them to images and subtract them from the images of the appropriate Master-PDF (You only need to create them once).
When the resulting image is blank (every pixel is pure black), they are identical, so the generated PDF looks the same as the master, if they are not, something differs from the master and therefore the test didn't pass.
In combination with PHPUnit you should have a possibility to automate the testing rather easy.
Related
Situation
I have a website written in PHP.
In PHP, I can extract the text inside a pdf file uploaded to the same website and so on.
I found the tabula-java github repo.
So what's the issue?
I have tried the mac app for tabula. I noticed that I needed to highlight a certain section of the pdf before the table data can be converted.
However, that's not what I want to accomplish. I want to run tabula in the background and on demand. When my website receives a file upload and certain conditions are satisfied, I want to call the tabula as a service somehow and feed it the unstructured data and then get back the tabulated data.
How do I go about doing this?
One way is to wrap the tabula-extractor command line command and return the results into your application.
For example, in R, the tabulizer package works this way.
I am doing a bulk generation of pdf files based on templates and I ran into big performance issues pretty fast.
My current scenario is as follows:
get data to be filled from db
create fdf based on single data row and pdf form
write .fdf file to disk
merge the pdf with fdf using pdftk (fill_form with flatten command)
continue iterating over rows until all .pdf's are generated
all the generated files are merged together in the end and the single pdf is given to the client
I use passthru to give the raw output to the client (saves time writing file), but this is just a little performance improvements. The total operation time is about 50 seconds for 200 records and I would like to get down to at least 10 seconds in some way.
The ideal scenario would be operating all these pdfs in memory and not writing every single one of them to separate file but then the output would be impossible to do as I can't pass that kind of data to external tool like pdftk.
One other idea was to generate one big .fdf file with all those rows, but it looks like that is not allowed.
Am I missing something very trivial here?
I'm thanksfull for any advice.
PS. I know I could use some good library like pdflib but I am considering only open licensed libraries now.
EDIT:
I am up to figuring out the syntax to build an .fdf file with multiple pages using the same pdf as a template, spent few hours and couldn't find any good documentation.
After beeing faced with the same problem for a long time (wanted to generate my pdfs based on LaTeX) i finally decided to switch to another crude but effective technique:
i generate my pdfs in two steps: first i generate html with a template engine like twig or smarty. second i use mpdf to generate pdfs out of it. I tryed many other html2pdf frameworks and ended up using mpdf, it's very mature and is developed since a long time (frequent updates, rich functionality). the benefit using this technique: you can use css to design your documents (mpdf completely features css) - which comes along with the css benefit (http://www.csszengarden.com) and generate dynamic tables very easy.
Mpdf parses the html tables and looks for the theader, tfooter element and puts it on each page if your tables are bigger than one page size. Also you have the possibility to define page header and page footer elements with dynamic entities like page nr and so on.
i know, using this detour seems to be a workaround, but to be honest, no latex, pdf whatever engine is as strong and simple as html!
Try a different less complex library like fpdf (http://www.fpdf.org/)
I find it quite good and lite.
Always find libraries that are small and only do what you need them to do.
The bigger the library the more resources it consumes.
This won't help your multiple-page problem, but I notice that pdftk accepts the - character to mean 'read from standard input'.
You may be able to send the .fdf to the pdftk process via it's stdin, in order to avoid having to write them to disk.
We currently use PDForm to grab a blank pdf file (no values, just form fields and text) and list the form fields. We then query our database for the values which match those field names and create a pdf file with the newly populated data which the user can download from our site. The thing is PDForm is about $5,000 per machine and we are migrating servers. We want an alternative which is actively supported and recommended by the community.
I know Zend is working on a PDF manipulation extension, but we need something quick. I have done testing with PDFtk but the last update for that project was in 2006 and it now seems dead. It would be fine as it is open source, however it seems to be causing errors with certain files that seem to be generated with PDFPenPro (our pdf form creator).
Another solution I thought up was why not just use iText and write a java wrapper which accepts command line input, so that PHP can call it with passthru() or exec(). There are other applications that will work should we completely rewrite our code but we do not want to do that.
What we need.
The ability for PHP to receive the PDF form field names.
PHP to then either create and FDF file (then merge it with the PDF) or send a string to a command line application which will populate the fields with values from our database.
The user can then download the newly created PDF file with the populated form fields.
Am I moving in the write direction by creating a java command line application that will use iText to parse and create the PDF files specified by PHP or does anyone know of any cost effective alternatives?
TCPDF seems to have the most robust feature set that I have seen so far.
Thanks, d2burke, for the tip on TCPDF. I'm not trying to do quite as much as the OP, but the software packages available to accomplish any kind of pdf generation are in the $2k to $3k range. TCPDF is php based, open source and the guy developing it is very supportive.
Always donate to these guys! Where in the world would web development be without it?
So since none of the above solutions would work since TCPDF doesn't work with forms the way we are wanting and since PDFlib converts the form fields to blocks we decided to create a command line wrapper for iText which will grab the form field names from the PDF and then populate them based on the database values.
I don't know if another product whose license costs range from $1k -> $3k could be considered "cost effective", but PDFlib work quite nicely. And if you don't need the PPS functionality, it does get cheaper.
I have to make an image of a dynamic page i.e. the page keeps on changing in every 5 minutes.
I want to make images of that very page that keeps on changing so that i can have its records saved in the form of images.
How can i do that using php??
i have no idea about this and a little elaboration in answers will be highly appreciated!!
Two steps:
1: Create a script that captures the current data in image form.
If you provide more information about what you mean when you say "create an image of dynamic data", I can probably point you to some resources you can use. For now, just have a look at the GD library.
2: Set up a job that runs the script every 5 minutes
This can be done via Cron. I would suggest investigating if you can run the script when the data changes, instead of at specific intervals.
http://www.devarticles.com/c/a/PHP/Generating-Images-on-the-Fly-With-PHP/
http://www.thesitewizard.com/php/create-image.shtml
Getting a screenshot of a web page isn't an easy task.
You can choose one of the online services that do that for you and you can download the images from there.
Otherwise, I have found a solutions using webkit and python but you will need full access to your linux server in order to install the necessary packages, then you will be able to call that script from php and get your screenshots.
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.