I am working on a accouting application. The user will upload the desired pdf or doc bank statement in the application. I need to read/parse the document and insert the amount/cheque number etc...(according to my database structure) in the database.
Please help in achieving the same.
PDF is made for representation, not to work with the data inside.
You might be lucky with pdftotext or catdoc.
I've been working on this same issue for over 2 weeks now and I have to say it is quite a task. I have had some success finding a php class to extract the text , but the problem is it will not work on every version of the .pdf format it's hit and miss. And drumming one up yourself will take awhile figuring out the encoding and compression issues. Right now I'm actually looking at some python libraries. It's just too time consuming for me to write one of these up from scratch for now.
Related
I am having a big database roughly it has 5 lakh (500K) entries now all those entries also have some document associated with them (i.e. every id has at least pdf file). Now I need a robust method to search for a particular text in those pdf files and if I find it, it should return the respective 'id'
kindly share some fast and optimized ways to search text in a pdf using PHP. Any idea will be appreciated.
note: Changing the pdf to text and then searching is not what I am looking for obviously, it will take a longer time.
In one line I need the best way to search for text in pdf using PHP
If this is a one-time task, there is probably no 'fast' solution.
If this is a recurring task,
Extract the text via some tool. (Sorry, I don't know of a tool.)
Store that text in a database table.
Apply a FULLTEXT index to that table.
Now the searching will be fast.
I myself wrote a website in ReactJS to search for info in PDF files (indexed books), which I indexed using Apache SOLR search engine.
What I did in React is, in essence:
queryValue = "(" + queryValueTerms.join(" OR ") + ")"
let query = "http://localhost:8983/solr/richText/select?q="
let queryElements = []
if(searchValue){
queryElements.push("text:" + queryValue)
}
...
fetch(query)
.then(res => res.json())
.then((result) =>{
setSearchResults(prepareResults(result.response.docs, result.highlighting))
setTotal(result.response.numFound)
setHasContent(result.response.numFound > 0)
})
Which results in a HTTP call:
http://localhost:8983/solr/richText/select?q=text:(chocolate%20OR%20cake)
Since this is ReactJS and just parts of code, it is of little value to you in terms of PHP, but I just wanted to demonstrate what the approach was. I guess you'd be using Curl or whatever.
Indexing itself I did in a separate service, using SolrJ, i.e. I wrote a rather small Java program that utilizes SOLR's own SolrJ library to add PDF files to SOLR index.
If you opt for indexing using Java and SolrJ (was the easiest option for me, and I didn't do Java in years previously), here are some useful resources and examples, which I collected following extensive search for my own purposes:
https://solr.apache.org/guide/8_5/using-solrj.html#using-solrj
I basically copied what's here:
https://lucidworks.com/post/indexing-with-solrj/
and tweaked it for my needs.
Tip: Since I was very rusty with Java, instead of setting classpaths etc, quick solution for me was to just copy ALL libraries from SOLR's solrj folder, to my Java project. And possibly some other libraries. May be ugly, but did the job for me.
I'm trying to develop a PHP script that lets users upload shapefiles to import to a postGIS database.
First of all, for the conversion part, AFAIK we can use shp2pgsql to convert the shapefile to a postgresql table; I was wondering if there is another way of doing the conversion, as I would prefer not to use the exec() command.
I would also appretiate any idea on storing the data in a way that does not require dozens of uniquenamed tables.
There seems to be no other way than using the postgresql's binary to convert the shapefile. Although it is not really a bad choice, I would rather not use exec() if there is a PHP native function, or apache module to do it!
However, it sounds like exec is the only sane option available. So I'm going to use it.
No hard feelings! :)
About the last part, it's a different question and should be asked separately. Although, I'm afraid there is no other way of doing it.
UPDATE example added
$queries = shell_exec("shp2pgsql -s ".SRID." -c $shpfilpath $tblname")
or respond(false, "Error parsing the shapfile.");
pg_query($queries) or respond(false, "Query failed!");
SRID is a constant containing the "SRID"!
$shpfilpath is a path to the desired ShapeFile
$tblname is the desired name for the table
See this blog post about loading shapefiles using the PHP shapefile reader plugin from here. http://www.phpclasses.org/package/1741-PHP-Read-vectorial-data-from-geographic-shape-files.html. The blog post focuses on using PHP on the backend to load data for a Flash app, but you should be able to ignore the flash part and use the PHP portion for your needs.
Once you have the data loaded from the shapefile, you could convert the geometry to a WKT string and use ST_GeomFromText or other PostGIS functions to store in the database.
Regarding the unique columns for a shapefile, I've found that to be the most straightforward way to store ad-hoc shapefile attributes and then retrieve that data. However, you could use a "tuple" system, and convert the attributes to strings, then store them in arbitrarily named columns (col1, col2, col3, etc.) if you don't care about attribute names or types.
If you cared about names and types, you could go one step further and store them as a shapefile "schema" in another table.
Write your shp2pgsql and define its parameters using text editor ie
sublime notepad etc.
Copy, paste and change shapefile name for each
layer.
Save as a batch file .bat.
Pull up command window.
Pull up directory with where yu .bat file is saved.
Hit enter and itll run the code for all your shapefiles and they will be uploaded to the
database you defined in writing your code.
Use qgis, go to postgis window and hit connect.
You are good to go your shapefiles are now ready to go and can be added as layers to your map. Make sure the spatial reference matches what they were prior to running it. Does that make sense? I hope that helped its the quickest way.
Adding this answer just for the benefit of anyone who is looking for the same as the OP and does not want to rely on eval() nor external tools.
As of August 2019, you could use PHP Shapefile, a free and open source PHP library I have been developing and maintaining for a few years that can read and write any ESRI Shapefile and convert it natively from/to WKT and GeoJSON, without any third party dependency.
Using my library, which provides WKT to use with PostGIS ST_GeomFromText() function and an array containing all the data to perform a simple INSERT, makes this task trivial, fast and secure, without the need of evil eval().
I'm new in php and mysql. Now i facing a problem is i need search data in a large database, but it take more than 3 minute to search a word, sometime the browser show timeout. I using technique FULLTEXT to do a searching, so any solution to decrease the searching time?
create index for the table field which you will prefer subsequently, even it take some memory space query result should return best results within less time.
This doesn't answer your question directly but is a suggestion:
I had the same problem with full text search so I switched to SOLR:
http://lucene.apache.org/solr/
It's a search server based on the Lucene library written in Java. It's used by some of the largest scale websites:
http://wiki.apache.org/solr/PublicServers
So speed and scalability isn't an issue. You don't need to know Java to implement it however. It offers a REST interface that you can query and even gives the option to return the search results in PHP array format.
Here's the official tutorial:
https://builds.apache.org/job/Solr-trunk/javadoc/doc-files/tutorial.html
SOLR searches through indexed files so you need to get your database contents into xml or json files. You can use the Data Import Handler extension for that:
http://wiki.apache.org/solr/DataImportHandler
To query the REST interface you can simply use get_file_contents() php function or CURL. Or the PHP sdk for SOLR:
http://wiki.apache.org/solr/SolPHP
Depends on how big your database is. Adding an index for the field you are searching is the first thing to do.
I have been into the same problem and adding an index for the field worked great.
I have files I need to convert into a database. These files (I have over 100k) are from an old system (generated from a COBOL script). I am now part of the team that migrate data from this system to the new system.
Now, because we have a lot of files to parse (each files is from 50mb to 100mb) I want to make sure I use the right methods in order to convert them to sql statement.
Most of the files have these following format:
#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
the address2 is optional and can be empty
or
#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
these are the 2 most common lines (I'll say around 50%), other than these, all the line looks the same but with different information.
Now, my question is what should I do to open them to be as efficient as possible and parse them correctly?
Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.
If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.
If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.
When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).
Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).
Regular expressions allow you to identify the content of line if you have variations within the same file.
But if your files are simple, try ghoti suggestion (fgetscv), will work fine.
If you're already familiar with PHP then using it is a perfectly fine tool.
If records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.
I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.
If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.
You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv
Just sit and parse.
It's one-time operation and looking for the most efficient way makes no sense.
Just more or less sane way would be enough.
As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.
first off I am fairly new to php like 3 weeks working with it, and am loving it so far. Its a fantastic language.
I am running into a issue though, I have a client who wants the information collected in on a form on this website to then be imported into a excel documents that he already has created. Since I am fairly new I've been googleing for the past two hours and have come up with so many different answers that my head is spinning.
I was wondering if someone can instruct me if first this can be done and what is the best method.
Or if you know a website that might have already explained this in a simple way if you could direct me to that.
Thanks guys, hopefully someday I can be as smart as you :)
Peace
To start with, you're going to need a library capable of reading your Excel template, such as PHPExcel, which you can then populate with the data from the form and save
Hey Chris, Sounds like what you are needing to do is call a COM object (Excel automation), from PHP. I Googled calling COM objects from PHP and found a site that suggested doing something like this.. the sample is for word but it should be simple to translate the idea to excel.. As discussed below, can you assume windows? Is this code running against excel on the browser machine or against some excel data on the server?
<?
$word=new COM("word.application") or die("Cannot start word for you");
print "Loaded word version ($word->Version)\n";
$word->visible =1;
$word->Documents->Add();
$word->Selection->Typetext("Dit is een test");
New efficiencies lie ahead
See the new IBM System x3650 M3 Express$word->Documents[1]->SaveAs("burb ofzo.doc");
$word->Quit();
?>
Here is a link to using COM from PHP:
PHP: COM
Saving the data in a .csv (comma delimited) file may be a quick option if you can arrange the spreadsheet at all.
Our answer is to save them as is into the database and convert them to <BR> for display to a web page. That way if a non-web program looks at the data, it will display semi-correctly without change. Yes some things won't always display properly for the program, but they will for the web page.