Is there a PDF parser for PHP? [closed]

Is there a PDF parser for PHP? [closed] - php

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Hi I know about several PDF Generators for php (fpdf, dompdf, etc.)
What I want to know is about a parser.
For reasons beyond my control, certain information I need is only in a table inside a pdf
and I need to extract that table and convert it to an array.
Any suggestions?

I've written one before (for similar needs), and I can say this: Have fun. It's quite a complex task. The PDF specification is large and unwieldy. There are several methods of storing text inside of it. And the kicker is that each PDF generator is different in how it works. So while something like TFPDF or DOMPDF creates REALLY easy to read PDFs (from a machine standpoint), Acrobat makes some really hellish documents.
The reason is how it writes the text. Most DOM based renderers --that I've used-- write the entire line as one string, and position it once (which is really easy to read). Acrobat tries to be more efficient (and it is) by writing only one or maybe a few characters at a time, and positioning them independently. While this REALLY simplifies rendering, it makes reading MUCH more difficult.
The up side here, is that the PDF format in itself is really simple. You have "objects" that follow a regular syntax. Then you can link them together to generate the content. The specification does a good job at describing the file format. But real world reading is going to take a bit of brain power...
Some helpful pieces of advice that I had to learn the hard way if you're going to write it yourself:
Adobe likes to re-map fonts. So character 65 will likely not be A... You need to find a map object and deduce what it's doing based upon what characters are in there. And it is efficient since if a character doesn't appear in the document for that font, it doesn't include it (which makes life difficult if you try to programmatically edit a PDF)...
Write it as abstract as possible. Write classes for each object type, and each native type (strings, numbers, etc). Let those classes parse for you. There will be a fair bit of repetition in there, but you'll save yourself in the end when you realize that you need to tweak something for only one specific type)...
Write for a specific version or two of the PDF spec, and enforce it. Check the version number, and if it's higher than you expect, bail... And don't try to "make it work". If you want to support newer versions, break out the specification and upgrade the parser from there. Don't try to trial and error your way up (it's not fun)...
Good luck with compressed streams. I've found that typically you can't trust the length arguments to verify what you are uncompressing. Sometimes (for some generators) it works well... Others it's off by one or more bytes. I just attempt to deflate it if the filter matches, and then force the length...
When testing lengths, don't use strlen. Use mb_strlen($string, '8bit') since it will compensate for different character sets (and allow potentially invalid characters in other charsets).
Otherwise, best of luck...

I use PDFBox for that (http://pdfbox.apache.org/). This software is javabased and platform independend. It works fast and reliable. You can use it via exec or shell execute or via a PHP/Java-Bridge (http://php-java-bridge.sourceforge.net/)

Have you already looked at xPDF ? There is a program in there called pdftotext that will do the conversion. You can call it from PHP and then read in the text version of the PDF. You will need to have the ability to run exec() or system() from php, so this may not work on all hosted solutions though.
Also, there are some examples on the PHP site that will convert PDF to text, although its pretty rough. You may want to try some of those examples as well. On that PHP page, search for luc at phpt dot org.

Zend_Pdf is part of the Zend Framework. Their manual states:
The Zend_Pdf component is a PDF
(Portable Document Format)
manipulation engine. It can load,
create, modify and save documents.
Thus it can help any PHP application
dynamically create PDF documents by
modifying existing documents or
generating new ones from scratch.

Have a look at GhostScript or ITextSharp, there are various cross-platform version of both.

It may not actually be a table inside the PDF as the PDF loses that sort of information...

This is PHP PDF parser, which exists in two flavours:
Free version can parse PDFs up to format PDF 1.5
Commercial add-on can parse any PDF format (up to current 1.9)

Related

PHP str_replace into microsoft word template

Right now I'm get task to make generate contract letter function in HRMS.
I'm already using CKEditor but the result is very different since the purpose made CKEditor is not like Microsoft Word or Google Docs purpose.
So I'm having idea that I'm making the template first in Microsoft Word and use PHP function str_replace to passing the data into Microsoft Word template.
The question is :
1. With that flow, is it possible to do that?
2. If Question 1 is possible can you hit me with the sample?
Many Thanks,
Hendra

There are several Classes that can do at least part of what you are trying to do:
wrklst/docxmustache
openTBS – Tiny But Strong
PHPWord
docxtemplater pro (basic opensource / free version / MIT license available as of writing; image replacing is a commercial plugin)
docxpresso (commercial)
phpdocx (commercial)
The first 4 of these are at least partially open source and investigating the code will help you understand the process, which is not trivial with word. In addition you can check out http://officeopenxml.com for the format details.
The main problem I see is with proper HTML to openXML conversion. Meaning to convert the styling from CKEditor (which might be HTML) into the proper XML Styling, which functions quite differently and a direct translation is not trivial. Check out https://github.com/wrklst/docxmustache/blob/master/src/WrkLst/DocxMustache/HtmlConversion.php so see some basic HTML conversion on singular runs of bold, italic and underlined text.
To my knowledge there is no maintained open source package that delivers proper html to openxml conversion. If you need this and cannot write it yourself, you will probably go for one of the paid solutions.
Good luck.

Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
source : https://docxtemplater.readthedocs.io/en/latest/goals.html
You could use the library I created in answer for this problem : https://github.com/open-xml-templating/docxtemplater , it works with JS in the browser or with node.js.

Caching web pages using PHP (for offline viewing)

I'm working on a personal project to view web pages offline. The first idea that I came up with is using file_get_contents to get the contents of a specific url but this only gets the html and not the assets in that page(css, images, javascript, etc.). So I had to write regex to get the stylesheets and images in the page:
$css_pattern = '/\S*\.css"/';
$img_src_pattern = '/src=(?:"|\')?.+\.(?:gif|jpg|png|jpeg)(?:"|\')/';
preg_match_all($css_pattern, $contents, $style_matches);
preg_match_all($img_src_pattern, $contents, $img_matches);
This works but there are also images link in the css as well. And I'm still thinking how to deal with those.
There are also projects like ganon https://code.google.com/p/ganon/ and simple html parser that might make my life easier but I prefer using regex because I want to learn more about it.
The question is: is there a better way of doing this project? The app will probably have folders in which to save assets and html for each site and it will probably become unwieldy. I've heard of things like manifest file in html5 but I'm not sure if that's possible if you don't own the site. Any ideas? If there's no other way to do this then maybe you can just help me improve the regex that I have above. I basically have to use str_replace and foreach to get the stylesheets:
$stylesheets = array();
foreach($style_matches[0] as $match){
$stylesheets[] = str_replace(array('href=', '"', "'"), '', $match);
}
Thanks in advance!

I prefer using regex because I want to learn more about it.
Parsing HTML with regex is possible albeit non-trivial. A good introduction is given in the following paper:
REX: XML Shallow Parsing with Regular Expressions
The regular expressions used in that paper (REX) are not the ones used in PHP (PCRE), however you should be able to understand it if you're willing to learn, it's similar.
Following what that paper outlines and writing regular expressions in PHP on your own with some nice test-cases should be a real training camp for you digging into regular expressions.
Next to the regular expressions you also need to deal with character encodings which is another field of it's own and then adopting the parser for an encoding (if you do not re-encode before parsing).
If you're looking specifically for an HTML 5 compatible parser, it is specified as part of the HTML 5 "specification", but you can not do it precisely with regular expressions any longer in a sane way (at least as far as I know about it):
12.2 Parsing HTML documents — HTML Living Standard — Updated ca. daily
8.2 Parsing HTML documents — HTML5 — A vocabulary and associated APIs for HTML and XHTML W3C Candidate Recommendation 17 December 2012
For me that type of parsing looks like a large amount of overhead, but peek into the outline of the HTML 5 Parser and you get an idea what you could all take care of for HTML parsing nowadays. It seems like those guys and girls really needed to push anything in they could imagine. Actually the following engines/browsers have a HTML 5 Parser:
Gecko 2
Webkit
Chrome 7 (Webkit)
Opera 11.60 (Ragnarök)
IE10
From personal experience in the PHP eco-system there are not so many SGML based / "loose" / low-level / tag-soup HTML parsers. If I would write one, I would also use regular expressions for string parsing, the REX shallow parsing article has some good discussion. However I would probably only use such a low-level HTML parser to make any HTML consumable for DOMDocument or some other validation/fixing related stuff and won't use it for further parsing/document abstraction. DOMDocument is pretty powerful especially to gather links which you describe above.
For the rest of your question, you find all the elements you need to bring together outlined in diverse HTTP related RFCs, so you need to decide on your own which link resolving algorithm you want to support and how you re-map the static CSS/image/js files if you save them again. You normally then re-write the HTML as well for which DOMDocument is really handy.
Also you should store some HTTP headers inside the HTML file via the meta element. Especially for the encoding unless you don't re-encode it (which can be useful for offline reading anyway). Some of the more general Q&A suggestions for HTML authoring apply for a static cache as well.
The html5 manifest file is actually something different. The original server should have supported it. That is likely not the case (or you need to build a parser of it as well and process it). So if you create a mirror, you might want to also point out all static resources that can be stored locally for offline usage. That is some nice idea, I have not yet seen this implemented by tools like wget, so it's probably worth to play with that idea a little.
Instead of the HTML5 manifest file you might have also related to one of the following container formats:
Mozilla Archive Format - MAFF
MIME HTML - MHTML
Webarchive
Another one of these formats/extensions (here: SingleFile Chrome extension) makes use of the Data URI scheme according to wikipedia, which might be also useful in this context albeit I would not favorite it, I'd say it's better to have an algorithm that is able to re-write URLs to local file-system in a reproduce-able manner so that you can dump multiple HTML files with the same assets without fetching the assets multiple times.

Pure WBXML encoding for PHP?

Is there a native PHP wbxml API that can be used platform-independently? Perhaps a loadable module?
I have seen the pecl implementations but I have not been able to successfully work with the builds on win32 platforms.

I am not an expert, but what I found out there numbered two options, essentially.
One, the pecl library that you are having trouble with.
Two, I found WBXML encoder and decoder classes in Horde of all places. They might give you a starting point, and since they are open source, they might meet your needs quite nicely. Here is a link where I found them.
http://phpxref.com/xref/horde/lib/XML/WBXML/index.html

I don't know a huge amount about WBXML, but from what I can gather it's a binary-formatted XML file. I suppose at the simplest you could use the XML modules such as simpleXML to generate your XML document, output it as a string and then use PHP's built in file handling functions (fopen, fwrite, etc) to dump the string as binary data to a file. To reverse the process load the file as a string and have SimpleXML parse it.
However, without knowing the specific details of the WBXML format, I'm sure there's more to it tan that. You'd also have to implement the necessary code yourself, but as you could implement it in PHP itself that should make cross-platform portability a bit simpler to accomplish.
Not really an answer as such, I'm afraid, but I hope it gets you going in the right direction.

Lex and Yacc in PHP [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Is there an implementation of Lex and Yacc in PHP?
If not, can anyone suggest a lexical analyser and parser generator (ie, anything like Lex and Yacc) that will create PHP code. I'm not too worried about the performance of the resulting parser.
I am sick of using regex to parse things that really shouldn't be parsed with regex...

There's JLexPHP: https://github.com/wez/JLexPHP/blob/master/jlex.php
I've not used it, but there's this: http://pear.php.net/package/PHP_ParserGenerator , which creates a PHP Parser from a Lemon grammar. The project seems to be inactive though.
I also found this project: http://code.google.com/p/antlrphpruntime/ , which uses Antlr. Again inactive though.

Been for looking for this kind of thing for a while. After finding this post, I've tried the ANTLR PHP runtime. I can report that it's far from being finished. There are several errors in the generated code, where the original java runtime classes has not been properly translated to PHP (nested class declarations, using '.' instead of '.' when trying to access class methods operator).
The ANTLR framework itself is quite powerful (can't attest to the efficiency of the generated code).
Especially the graphical tool ANTLRWorks makes it easy to create and debug grammas. Just too bad about the PHP version. It's possible to roll your own though. The best solution may be to analyse the generated ANTLR runtime class, figure out how it's works, and come up with a light weight less enterprisey version thereof.

Cheap trick: code a recursive descent parser. This will cover a lot of cases. See
Is there an alternative for flex/bison that is usable on 8-bit embedded systems?

Another sugestion: avoid Lex/Yacc approach, use PHP as a good string parser,
for simple tasks and simple translators: use perl-regular expressions (PCRE), with PHP preg_* functions. The callback have the same power of Awk or Yacc rules, but with PHP code (!).
for complex tasks: translate (with a PHP string or PCRE translator or another translator) your language to a XML dialect, process with DOM and/or XSLT. XSLT is "rule oriented" (se xsl:template) like Yacc. With XSLT you have also access to PHP functions with registerphpfunctions(). If need back to a non-XML language or a I/O complex format, process de output (a saved XML or a XSLT-output) again with PCRE and string functions.
PS: for more rich and complex languages, the "translation to XML" task is possible (see xSugar theory), but not always easy. You can use PHP-PEG to translate with PHP, or you can translate with a external tool, for cache the XML, or for use a permanent-xml-translated version of your specific-language-scripts.
These two options have the same (Lex and Yacc) power, and use only build-in PHP classes and functions.
For the complex cases, remember that XML, XSLT, etc. are W3C standards, then, XML-dialects are "standard formats", XML-tools are optimized and still evolving, and XML-data are interchangeable.

Minify / Obfuscate PHP Code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I use Haxe to generate PHP code. (This means you write you code in the Haxe language and get a bunch of php files after compiling.) Today a customer told me that he needs a new feature on a old project made with Haxe. He also told me that he altered some small things on the code for his own needs. Now I first have port his changes to my Haxe code and then add the new feature, because otherwise his changes will be overwritten by the next time I compile the project.
To prevent that this happens again I am looking for some kind of program that minifies / obfuscates the PHP code. The goal is to make the code unreadable / uneditable as possible.
The ideal tool would run under Linux and could process whole folders and all it containing files.
Anybody any suggestions?

Why not use the php buid in function php_strip_whitespace()
string php_strip_whitespace ( string $filename )
Returns the PHP source code in filename with PHP comments and whitespace removed. This may be useful for determining the amount of actual code in your scripts compared with the amount of comments. This is similar to using php -w from the commandline.

I agree with the comment, what you are doing is very underhanded, but after 10 years in this biz I can attest to one thing: Half the code you get is so convoluted it might as well have been minified, and really function/var names are so often completely arbitrary, i've edited minified js and it wasn't much more of a hassle than some unminified code.
I couldn't find any such script/program, most likely because this is kind of against the PHP spirit and a bit underhanded, never the less.
First: Php isn't white space sensitive, so step one is to remove all newlines and whitespace outside of string.
That would make it difficult to mess with for the average tinkerer, an intermediate programmer would just find and replace all ;{} with $1\n or something to that effect.
The next step would be to get_defined_functions and save that array (The 'user' key in the returned array), you'll need to include all of the files to do this.
If it's oo code, you'll need get_defined_classes as well. Save that array.
Essentially, you need to get the variables, methods, and class instances, you'll have to instantiate the class and get_object_vars on it, and you can poke around and see that you can get alot of other info, like Constants and class vars etc.
Then you take those lists, loop through them, create a unique name for each thing, and then preg_replace, or str_replace that in all of the files.
Make sure you do this on a test copy, and see what errors you get.
Though, just to be clear, there is a special place in hell reserved for people who obfuscate for obfuscation's sake.
Check out: get_defined_functions get_declared_classes and just follow the links around to see what you can do.

We use Zend Guard to encode our PHP code with certain clients, but as Parrots said, you need to be sure you own the code. We only encode in certain situations, and only when it's explicit that we retain ownership of the code, otherwise Parrots is right, the client has a right to modify it.

I know of Zendguard, Expressionengine used it to encrypt their trial version's core code. You could always give that a go although you need to pay for it.
However, while I understand the frustration of having to port his changes, I assume they purchased the code from you? They have the right to modify it. You just have the right to charge them extra to port their changes ;) Imagine if you stopped working for them, how could they ever hire someone else to update the code?

Our PHP Obfuscator does exactly the the job of stripping comments, whitespaces, and scrambling identifiers.
It operates across a complete set of PHP files to ensure that scrambled symbols are scrambled
consistently across those files, ensuring correct operation even after scrambling.
EDIT 2013: Now encrypts string literals to make them unreadable. Operates under Windows, and on Linux under Wine.

You can try PHP Obfuscator or the bcompiler PHP extension.

I have just find minify-service for PHP. It's really looks usefull. They says, that obfuscating will be available soon. I hope this is true :)
http://customhost.com.ua/php-minify/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.