I need to create weekly texts using the same template. Being the lazy programmer I am I wanted to automate most of it by just creating a Google Form where I can input the data. By then running a PHP script I want to parse the new entry and put it into an automatically created new document.
I have created the template with placeholders such as <DATE> or <NEWMEMBERCOUNT> that I later want to replace by the values entered using the Google Form.
For this I have already utilized the packages google/apiclient and asimlqt/php-google-spreadsheet-client to read the form results (which are stored in a spreadsheet) and duplicate the template doc for each entry.
I'm almost finished and just need to replace the placeholders by their corresponding values, but I can't seem to find a way to do that. Specifically I need to read the content of the document, perform some transformations on it (i.e. replacing the placeholders) and save it with this transformed text.
I should have thought about this before starting to program it..
Is it possible for me to edit documents at all, using just PHP? If so, how could I go about it? Any guidance is appreciated!
You can't edit in situ, but you can download, edit, upload. Is this a classic mailmerge, ie. take a spreadsheet containing (rows of) data, apply a template to those rows that results in an output file for each row?
If so , simples...
Download the spreadsheet
Download the template
For each spreadsheet row
replace the placeholders with data
insert a new file to drive
That can all be done with the Drive API from PHP
this is not possible with anything except google apps script. see https://developers.google.com/apps-script/reference/document/document-app
you can use apps script to create a "contentService" and call it from your php. beware of limited quotas if you plan to have many daily calls.
more info about doing this content service is covered in other s.o. questions that ask that specifically.
Related
I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.
For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?
Insurance_Image
The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.
In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc
Thanks.
I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.
After you parse the document, simply use regex to pick the fields of interest.
I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.
The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.
Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.
This is just a speculative idea for a client who has a lot of PDF files.
Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?
The way I envisage the a system working would be:
Client uploads PDF via CMS
CMS calls some service / program to
extract the text
Algolia indexes the extracted and it's somehow
linked to the original PDF
It would need to be an automated system as the client shouldn't have to tell it to index.
It would be built in PHP, probably Laravel running on Ubuntu.
What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?
I'm also happy to have suggestions on other search services which may handle this.
Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).
To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.
You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .
For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search.
The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.
You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q.
Let me know if you have any questions.
Very sorry if this is not in the right place etc. I have been researching this for a while but its raised more questions!
I have developed a spreadsheet which I use to set a teams duties for a shift. There are 5 teams, each with staff which can change day to day.
the spreadsheet works fine, but its too complicated for some users. I am therefore trying to develop a straightforward web based form.
All the data is on the spreadsheet, held on a network drive (essentially locally).
I need to be able to have several combo/select boxes which get their values from a range of cells from the XLS. Along with the ability to output the final selections to a XLS sheet.
Finally, it needs to be able to load the previous day values on load.
What is the best way of developing a web page for this? Is Jscript the best option? Can I access a local file with jScript?
Thanks in advance
Adrian
The easiest option for you is to use google web forms. These allow creation of forms that will submit data to a google spreadsheet. Which is essentially an uploaded version of your local spreadsheet and can be downloaded to excel.
In case if you want more control and programming, pure javascript cant play with files, you need server side too. Javascript is not necessary unless you want to make your app do some visually fancy stuff. Since you mentioned php as a tag of this question, it seems you are a bit familiar with php. The task you have mentioned can be done using php programming as below:
Read excel file using an excel plugin
Parse relevant data using a text matching function, may require regular expressions knowledge.
display the form by building up the html and putting in the variables using the data obtained above.
Write a method to save the data submitted by the form to the same excel file using the excel plugin.
As its not convenient to play around with excel files. A better option would be to generate csv file or use a database using a database class . csv files can be parsed easily using text.
I have a database with about 10 tables and they are all interconnected in some way(foreign keys, assosiative tables).
I want to use all that data to plot it on my instance of Google Map as markers with infoboxes and be able to filter the map.
Judging from the Google Maps Articles you have to use XML with the data from the database to plot the markers and everything else.
But what would be the right way to generate that XML file? Have a huge SQL statement to generate one XML file with entire data from all tables upon the load of the web-page and the map or is there a more correct approach for this?
You in no way have to use XML to place markers on an instance of Google Maps. You could, but you don't have to if it seems difficult. I work a lot with the Google Maps V3 API and I would recommend you export your data to JSON and embed it in your document using PHP or make it available for JavaScript to load using Ajax.
Creating interactive Markers from the data is REALLY easy. You just need to iterate over your data, create a Marker object for each point you want on the map, supply some HTML you want displayed in the info window and show that info window on the Marker's click event.
Instead of walking you through with teaspoon accuracy I'll refer you to the Google Maps API v3 beginner tutorial which among other things includes examples of how to create Markers and display them on the map.
Fun fact, you can control which icon is displayed for each marker (you can supply an URL to any image you want), as well as make them bounce. To summarize, you have way more control using JavaScript than if you went with XML.
Regarding performance, I would heed cillosis' advice and cache your MySQL data in which ever format you end up choosing. If you were to go with JSON you could cache the result of that as well. You can simply save the output in a file called something like "mysql-export-1335797013.json". That number is a Unix timestamp with which you can extrapolate when the data needs to be refreshed.
Use SQL the first time to generate the XML for a specific query, and then cache that XML output for later use. The very first time it may be slow, but after that it will already be generated and will be really fast.
If you want to use XML because PHP and AJAX make it relatively easy, then do. That's why the examples use it. But you are definitely not restricted to XML. JSON is commonly used because it's also easy with PHP, a smaller download than XML and delivered in a form which is directly usable by Javascript. Or you could use anything else which can be manipulated by your page's Javascript.
With regard to whether to use one humungous query and data download or not, you don't have to do that either. You could do: it might be slow — not only to do the query but also to transfer the data, where caching the query results won't help. Your users will have to wait for the data to arrive and then be manipulated by your Javascript to appear on the map.
You might consider doing a fairly simple query to get basic data to display so the users get something to see reasonably quickly, following that up with more queries, perhaps as data is required. There is no point in downloading loads of InfoWindow data if the user is not going to click and see it. In that instance, deliver marker data and basic InfoWindow data and only get detailed data if the user actually requests it (that is, use a two-stage InfoWindow).
We currently use PDForm to grab a blank pdf file (no values, just form fields and text) and list the form fields. We then query our database for the values which match those field names and create a pdf file with the newly populated data which the user can download from our site. The thing is PDForm is about $5,000 per machine and we are migrating servers. We want an alternative which is actively supported and recommended by the community.
I know Zend is working on a PDF manipulation extension, but we need something quick. I have done testing with PDFtk but the last update for that project was in 2006 and it now seems dead. It would be fine as it is open source, however it seems to be causing errors with certain files that seem to be generated with PDFPenPro (our pdf form creator).
Another solution I thought up was why not just use iText and write a java wrapper which accepts command line input, so that PHP can call it with passthru() or exec(). There are other applications that will work should we completely rewrite our code but we do not want to do that.
What we need.
The ability for PHP to receive the PDF form field names.
PHP to then either create and FDF file (then merge it with the PDF) or send a string to a command line application which will populate the fields with values from our database.
The user can then download the newly created PDF file with the populated form fields.
Am I moving in the write direction by creating a java command line application that will use iText to parse and create the PDF files specified by PHP or does anyone know of any cost effective alternatives?
TCPDF seems to have the most robust feature set that I have seen so far.
Thanks, d2burke, for the tip on TCPDF. I'm not trying to do quite as much as the OP, but the software packages available to accomplish any kind of pdf generation are in the $2k to $3k range. TCPDF is php based, open source and the guy developing it is very supportive.
Always donate to these guys! Where in the world would web development be without it?
So since none of the above solutions would work since TCPDF doesn't work with forms the way we are wanting and since PDFlib converts the form fields to blocks we decided to create a command line wrapper for iText which will grab the form field names from the PDF and then populate them based on the database values.
I don't know if another product whose license costs range from $1k -> $3k could be considered "cost effective", but PDFlib work quite nicely. And if you don't need the PPS functionality, it does get cheaper.