Searching (extracting text) PDF files with Algolia

Searching (extracting text) PDF files with Algolia - php

This is just a speculative idea for a client who has a lot of PDF files.
Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?
The way I envisage the a system working would be:
Client uploads PDF via CMS
CMS calls some service / program to
extract the text
Algolia indexes the extracted and it's somehow
linked to the original PDF
It would need to be an automated system as the client shouldn't have to tell it to index.
It would be built in PHP, probably Laravel running on Ubuntu.
What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?
I'm also happy to have suggestions on other search services which may handle this.

Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).
To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.
You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .

For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search.
The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.
You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q.
Let me know if you have any questions.

Related

Generating multi-page PDF order with PHP, JS or JQuery?

I am working on developing a shipping/receiving system which I plan to setup on an intranet. I have experience with HTML, CSS, PHP, MySQL, JavaScript, JQuery, and AJAX. My basic goal is to be able to scan barcodes and then generate and save PDF's for printing and storage that can be 1 page or 100 pages. I basically want to create a header with order information such as Order ID, Customer/Vender, Date, Page Number, etc. with columns below containing information like Part Number, QTY, Description, etc.
I am not sure if the entire pages can be created with css and 'foreach variables' or if perhaps a template where text is simply placed on top of a default pdf would work best? In the past I have been able to take a basic template and enter text onto a single page pdf at a specific X and Y co-ordinate, but I am not sure about detecting page breaks and the such.
Any advice on where to begin would be greatly appreciate!
Thanks in advance :)

Take a look at FPDF, it's a quite powerful PHP library for creating PDF:s. I've used it quite a lot and I'm sure it would fit your needs.
I don't think it's necessary to have a template upon which you position text. You could for example use the table capability provided by third party scripts in FDPF for presenting your data.

Using a solution such as wkhtmltopdf which is a HTML to PDF converter, you can generate the html for the invoice using php and output the PDF file using the tool. On an Ubuntu box you can simply install it using sudo apt-get install wkhtmltopdf. wkhtmltopdf is a commandline tool so you might have a cron running in the background which picks up html files within a folder and converts them to PDF using the tool or using php exec() or system() functions to execute the program. Hope that helps

Get PDF output from XML generated by a PHP file and translated with an XSLT

I've used a couple of days to think of a best practice to generate a PDF, which end users can customize the layout for themselves. The PDF output needs to be saved on the server or sent back to the PHP file so the PHP file can save it, and the PHP file needs to know that it went OK.
I thought the best way to do this was to use XML, XSLT and Apache Cocoon. But I'm not sure if this is possible or if it's a good idea since I can't find any information of people doing anything similar. It cannot be an uncommon problem.
The idea came when I read about Cocoon converting XML through XSLT to PDF:
http://cocoon.apache.org/2.1/howto/howto-html-pdf-publishing.html
and being able to take in variables:
http://old.nabble.com/how-to-access-post-parameters-from-sitemap-td31478752.html
This is what I had in mind:
A php file gets called by a user, the php file generates a source XML file with a specific name
The php file then makes a request to Cocoon (on the same web server) to apply the user defined XSLT on the XML file. A parameter will be needed here to know which XSLT to apply.
The request is handled by the PHP file and then saved as a PDF on the server, and can later be mailed away.
Will this work at all? Is there a better way to handle this?
The core problem is that the users need to be able to customize the layout on the PDFs themselves, and I need the server to save the PDF and to mail it later on. The users will use it for order confirmations, invoices, etc. And I wouldn't like to hard code the layout for each user.

I've had some good results in the past by setting up JasperReports Server and creating reports using iReport Designer. They're both available in F/OSS ("community") editions, though you can pay for support and value-adds if you need those things.
This was a good solution for us, since we could access it via the Java API for our Java system, and via SOAP for our PHP system. The GUI designer made tweaking reports very easy for non-technical business staff too.

I use webkithtml2pdf to generate my PDF:s. Just create a document with HTML and CSS for printing like you would usually do, the run it through the converter.
It works great for generating things like invoices. You can use SVG for logos and illustrations, and they will look great in print since they are vector based. Even rounded corners with dotted outlines works perfectly.
A minor gotcha is that the input html must have th htm or html file name suffix, so you can't use the default tempfile functions.

Modify PDF Forms from a PHP site

We currently use PDForm to grab a blank pdf file (no values, just form fields and text) and list the form fields. We then query our database for the values which match those field names and create a pdf file with the newly populated data which the user can download from our site. The thing is PDForm is about $5,000 per machine and we are migrating servers. We want an alternative which is actively supported and recommended by the community.
I know Zend is working on a PDF manipulation extension, but we need something quick. I have done testing with PDFtk but the last update for that project was in 2006 and it now seems dead. It would be fine as it is open source, however it seems to be causing errors with certain files that seem to be generated with PDFPenPro (our pdf form creator).
Another solution I thought up was why not just use iText and write a java wrapper which accepts command line input, so that PHP can call it with passthru() or exec(). There are other applications that will work should we completely rewrite our code but we do not want to do that.
What we need.
The ability for PHP to receive the PDF form field names.
PHP to then either create and FDF file (then merge it with the PDF) or send a string to a command line application which will populate the fields with values from our database.
The user can then download the newly created PDF file with the populated form fields.
Am I moving in the write direction by creating a java command line application that will use iText to parse and create the PDF files specified by PHP or does anyone know of any cost effective alternatives?

TCPDF seems to have the most robust feature set that I have seen so far.

Thanks, d2burke, for the tip on TCPDF. I'm not trying to do quite as much as the OP, but the software packages available to accomplish any kind of pdf generation are in the $2k to $3k range. TCPDF is php based, open source and the guy developing it is very supportive.
Always donate to these guys! Where in the world would web development be without it?

So since none of the above solutions would work since TCPDF doesn't work with forms the way we are wanting and since PDFlib converts the form fields to blocks we decided to create a command line wrapper for iText which will grab the form field names from the PDF and then populate them based on the database values.

I don't know if another product whose license costs range from $1k -> $3k could be considered "cost effective", but PDFlib work quite nicely. And if you don't need the PPS functionality, it does get cheaper.

How would the conversion of a custom CMS using a text-file-based database to Drupal be tackled?

Just today I've started using Drupal for a site I'm designing/developing. For my own site http://jwm-art.net I wrote a user-unfriendly CMS in PHP. My brief experience with Drupal is making me want to convert from the CMS I wrote. A CMS whose sole method (other than comments) of automatically publishing content is by logging in via SSH and using NANO to create a plain text file in a format like so*:
head<<END_HEAD
title = Audio
keywords= open,source,audio,sequencing,sampling,synthesis
descr = Music, noise, and audio, created by James W. Morris.
parent = home
END_HEAD
main<<END_MAIN
text<<END_TEXT
Digital music, noise, and audio made exclusively with
#=xlink=http://www.linux-sound.org#:Linux Audio Software#_=#.
END_TEXT
image=gfb#--#;Accompanying image for penonpaper-c#right
ilink=audio_2008
br=
ilink=audio_2007
br=
ilink=audio_2006
END_MAIN
info=text<<END_TEXT
I've been making PC based music since the early nineties -
fortunately most of it only exists as tape recordings.
END_TEXT
( http://jwm-art.net/dark.php?p=audio - There's just over 400 pages on there. )
*The jounal-entry form which takes some of the work out of it, has mysteriously broken. And it still required SSH access to copy the file to the main dat dir and to check I had actually remembered the format correctly and the code hadn't mis-formatted anything (which it always does).
I don't want to drop all the old content (just some), but how much work would be involved in converting it, factoring into account I've been using Drupal for a day, have not written any PHP for a couple of years, and have zero knowledge of SQL?
How would I map the abstraction in the text file above so that a user can select these elements in the page-publishing mechanism to create a page?
How might a team of developers tackle this? How do-able is it for one guy in his spare time?

You would parse the text with PHP and use the Drupal API to save it as a node object.
http://api.drupal.org/api/function/node_save
See this similar issue, programmatically creating Drupal nodes:
recipe for adding Drupal node records
Drupal 5: CCK fields in custom content type
Essentially, you create the $node object and assign values. node_save($node) will do the rest of the work for you, a Drupal function that creates the content record and lets other modules add data if need be.
You could also employ XML RPC services, if that's possible on your setup.

Since you have not written any PHP for a long time, and you are probably in a hurry, I suggest you this approach:
Download and install this Drupal module: http://drupal.org/project/node_import
This module imports data - nodes, users, taxonomy entries etc.- into Drupal from CVS files.
read its documentations and spend some time to learn how to use it.
Convert your blog into CVS files. unfortunately, I cannot help you much on this, because your blog entries have a complex structure. I think writing a code that converts it into CVS files takes same time as creating CVS files manually.
Use Node Import module to import data into your new website.
Of course some issues will remain that you have to do them manually; like creating menus etc.

Generate ODT documents with dynamic images in PHP

I maintain a couple of web databases based on PHP and mySQL on a shared hosting package.
The databases have a mechanism for the user to upload OpenOffice documents with placeholders:
[person.name] [person.address] [person.postcode]
I then use this great PHP tool to run through the OpenOffice document and insert values from the database into it. The result is again, an OpenOffice document.
What it can't do is dynamic images.
Does anybody know a - preferably PHP-only - solution to insert images into OpenOffice documents?
I know PUNO. Can't use it in this context because it's shared hosting.
I know OpenOffice can be run as a daemon - ditto.
I know phpDocWriter. It was great for SXW files but is dead now.
I know OpenDocument is a collection of XML files in a ZIP file. I once tried to programmatically add a caption to every image in a ODT document. It drove me fricking crazy. I look with admiration upon developers who work with the format, but it's not for me.
I would really appreciate any hints on existing solutions.

I think odtPHP might be what you're looking for
is seems to be able to insert images on a placeholder in the document and reads simply from an array to see which image to place.
http://www.odtphp.com/index.php?i=tutorials&p=tutorial5
Now, if you do this as a post-process after your current code, or simply use it instead of TBS, you got everything you need IMHO
Alternatively, you can include a default image with a certain filename in your document, and simply replace that imagefile in the archive.

There is a new version of TbsOOo, it's OpenTBS and it has a feature for inserting/changing a picture in the file.
http://www.tinybutstrong.com/opentbs.php

Did you try to use the AddFileToDoc method to add an image to the document?
The documentation on this method is here:
http://www.tinybutstrong.com/tbsooo.php#AddFileToDoc

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.